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PROJECT  REPORT 


ISMIS  Symposium  was  organized  by  UNC-Charlotte  at  the  University 
Hilton  Hotel  in  Charlotte,  North  Carolina,  October  11-14,  2000. 

We  had  7  invited  talks  and  59  regular  presentations.  Special  session  on 
Evolutionary  Computation  (5  papers)  was  organized  by  Z.  Michalewicz  in 
the  second  day  of  the  Symposium.  112  papers  from  the  universities  and 
research  institutes  represented  by  ISMIS ’00  Program  Committee  members 
have  been  submitted. 

ISMIS'OO  proceedings  are  published  by  Springer- Verlag  in  LNCS/LNAI 
series  (No.  1932)  with  Zbigniew  W.  Ras  (UNC-Charlotte)  and  Setsuo 
Ohsuga  (Waseda  U.,  Japan)  as  its  co-editors. 

Also,  ISMIS’OO  Special  Issue  (about  200  pages  long)  containing  extended 
versions  of  10  papers  presented  at  ISMIS’OO  Symposium  will  appear  in 
Fundamenta  Informaticae  Journal  (published  by  IOS  Press)  sometime  this 
year.  Zbigniew  Ras  and  Essam  El-Kwae  are  its  guest  editors. 

Papers  from  the  following  seven  areas  have  been  presented  at  ISMIS’OO: 
evolutionary  computation,  intelligent  information  retrieval,  intelligent 
information  systems,  knowledge  representation,  knowledge  discovery  and 
learning,  logic  for  artificial  intelligence,  methodologies. 

The  list  of  ISMIS'OO  invited  speakers  &  titles  of  their  talks  is  given  below: 

•  Jaime  Carbonell  (CMU) 

"A  Machine  Learning  Perspective  on  Information  Filtering  " 

•  Bruce  Croft  (UM-Amherst) 

"Information  Systems  Based  on  Statistical  Language  Models” 

•  Philip  Emmerman  ( Army  Research  Lab) 

"Intelligent  Agent  Battlespace  Augmentation  " 

•  Bill  Harris  (Bank  of  America) 

"Architecting  to  Meet  Customer  Need” 

•  Ryszard  Michalski  ( GMU) 

"Learning  and  Evolution” 

•  Jeff  Scott  ( First  Union ) 

"The  Intelligent  Business"  (Banquet  Talk) 

•  Brian  Bachman  (NCR) 

"SQL-based  Data  Mining  Applications"  (Software  Presentation) 
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Because  of  ARO  grant,  ISMIS’OO  Organizing  Committee  was  able  to 
reduce  the  early  registration  fee  from  $330.00  to  $320  (and  the  late 
registration  fee  from  $380  to  $370)  for  all  ISMIS'OO  participants  and  also 
give  6  free  registration  awards  ($1,320.00  in  total)  to  Ph.D.  in  Information 
Technology  students  from  UNC-Charlotte.  Their  names  are  listed  below: 
Shishir  Gupta,  Vincent  Osisek,  Dongwan  Shin,  Pamela  Thompson,  Neal 
Wagner,  Xiaochen  Zhao. 

ARO  grant  was  used  to  pay  the  registration  fee  ($990.00  in  total)  and  the 
travel/accommodation  expenses  at  University  Hilton  for  three  invited 
speakers:  Jaime  Carbonell,  Bruce  Croft,  Ryszard  Michalski. 

Also,  we  used  ARO  grant  to  pay  partially  ISMIS’OO  mailing  and  printing 
costs. 

We  had  82  participants  at  the  symposium. 
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Preface 


This  volume  contains  the  papers  selected  for  presentation  at  the  Twelfth  In¬ 
ternational  Symposium  on  Methodologies  for  Intelligent  Systems  -  ISMIS  2000, 
held  in  Charlotte,  N.C.,  11-14  October,  2000.  The  symposium  was  co-organized 
by  the  College  of  Information  Technology  at  UNC-Charlotte  and  the  Polish- 
Japanese  Institute  of  Information  Technology.  It  was  sponsored  by  the  US  Army 
Research  Office,  NCR  Data  Mining  Laboratory,  College  of  IT  at  UNC-Charlotte, 
and  others. 

ISMIS  is  a  conference  series  that  was  started  in  1986  in  Knoxville,  Tennessee. 
Since  then  it  has  been  held  in  Charlotte  (North  Carolina),  Knoxville  (Tennessee), 
Torino  (Italy),  Trondheim  (Norway),  Warsaw  (Poland),  and  Zakopane  (Poland). 

The  program  committee  selected  the  following  major  areas  for  ISMIS  2000: 
Evolutionary  Computation,  Intelligent  Information  Retrieval,  Intelligent  Infor¬ 
mation  Systems,  Knowledge  Representation  and  Integration,  Knowledge  Disco¬ 
very  and  Learning,  Logic  for  Artificial  Intelligence,  and  Methodologies. 

The  contributed  papers  were  selected  from  112  full  draft  papers  by  the  fol¬ 
lowing  program  committee:  A.  Biermann,  P.  Bose,  J.  Calmet,  S.  Carberry,  N. 
Cercone,  J.  Chen,  W.  Chu,  B.  Croft,  J.  Debenham,  S.M.  Deen,  K.  DeJong,  R. 
Demolombe,  B.  Desai,  T.  Elomaa,  F.  Esposito,  A.  Giordana,  J.  Grzymala-Busse, 
M.  Hadzikadic,  H.  Hamilton,  D.  Hislop,  K.  Hori,  W.  Kloesgen,  Y.  Kodratoff,  J. 
Komorowski,  J.  Koronacki,  W.  Kosinski,  R.  Kostoff,  B.G.T.  Lowden,  D.  Maluf, 
D.  Malerba,  R.L.  de  Mantaras,  S.  Matwin,  R.  Meersman,  Z.  Michalewicz,  R. 
Michalski,  R.  Mizoguchi,  M.  Mukaidono,  L.  De  Raedt,  V.  Raghavan,  E.  Rosent¬ 
hal,  L.  Saitta,  A.  Skowron,  V.S.  Subrahmanian,  S.  Tsumoto,  T.  Yamaguchi,  G.P. 
Zarri,  M.  Zemankova,  N.  Zhong,  J.M.  Zytkow.  Additionally,  we  acknowledge  the 
help  in  reviewing  papers  from:  E.  El-  Kwae,  S.  Ferilli,  M.  Kryszkiewicz,  and  A. 
Wieczorkowska. 

We  wish  to  express  our  thanks  to  Jaime  Carbonell,  Bruce  Croft,  Philip  Em- 
merman,  Bill  Harris,  and  Ryszard  Michalski  who  presented  invited  talks  at  the 
symposium.  Also,  we  are  thankful  to  Zbigniew  Michalewicz  for  organizing  the 
Special  Session  on  Evolutionary  Computation.  We  express  our  appreciation  to 
the  sponsors  of  the  symposium  and  to  all  who  submitted  papers  for  presenta¬ 
tion  and  publication  in  the  proceedings.  Our  sincere  thanks  go  to  the  Organi¬ 
zing  Committee  of  ISMIS  2000.  Also,  our  thanks  are  due  to  Alfred  Hofmann  of 
Springer- Verlag  for  his  continuous  help  and  support. 


August  2000 


Zbigniew  W.  Ras 
Setsuo  Ohsuga 
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Abstract.  The  amount  of  on-line  information  is  growing  exponentially. 
Much  of  this  information  is  unstructured  and  language-based.  To  deal 
with  this  flood  of  information,  a  number  of  tools  and  language  technolo¬ 
gies  have  been  developed.  Progress  has  been  made  in  areas  such  as  in¬ 
formation  retrieval,  information  extraction,  filtering,  speech  recognition, 
machine  translation,  and  data  mining.  Other  more  specific  areas  such 
as  cross-lingual  retrieval,  summarization,  categorization,  distributed  re¬ 
trieval,  and  topic  detection  and  tracking  are  also  contributing  to  the 
proliferation  of  technologies  for  managing  information.  Currently  these 
tools  are  based  on  many  different  approaches,  both  formal  and  ad  hoc. 
Integrating  them  is  very  difficult,  yet  this  will  be  a  critical  part  of  build¬ 
ing  effective  information  systems  in  the  future.  In  this  paper,  we  discuss 
an  approach  to  providing  a  framework  for  integration  based  on  language 
models. 


1  Introduction 

Tools  for  managing  language-based  information  have  become  essential  compo¬ 
nents  of  modern  information  systems.  This  type  of  information,  in  the  form  of 
unstructured  or  semi-structured  text  (e.g.  HTML  or  XML),  is  found  throughout 
the  applications  that  are  driving  our  economy.  In  addition,  the  increase  in  the 
use  of  speech  input  and  data,  OCR,  and  metadata  descriptions  of  images  and 
video,  has  resulted  in  text  becoming  a  lingua  franca  for  information  systems. 
Although  considerable  progress  has  been  made  with  language-based  tools  such 
as  information  retrieval,  filtering,  categorization,  extraction,  summarization,  and 
mining,  their  performance  is  unreliable  and  the  effects  of  integrating  them  are 
unpredictable.  One  of  the  major  reasons  for  this  is  the  lack  of  a  unifying  formal 
framework  for  developing  and  combining  language-based  technologies.  Instead, 
the  tools  are  based  on  many  different  approaches  and  theories,  often  implicit 
and  sometimes  ad  hoc.  If  a  single  unifying  framework  and  architecture  for  in¬ 
formation  management  could  be  created,  it  would  enable  the  development  of 
significantly  more  effective  tools,  support  integration,  and  substantially  advance 
our  understanding  of  the  processes  underlying  information  access  and  organi¬ 
zation.  A  growing  number  of  researchers  believe  that  such  a  framework  can  be 
based  on  statistical  language  models. 
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The  language  modeling  approach  has  been  applied,  with  considerable  success, 
to  speech  recognition  and  machine  translation  [10,4,3],  More  recently,  there 
have  been  breakthroughs  in  applying  this  approach  to  information  retrieval  and 
extraction  [16, 15, 1, 12, 22, 2], 

The  use  of  language  models  is  attractive  for  several  reasons.  Building  an  in¬ 
formation  system  using  language  models  allows  us  to  reason  about  the  design 
and  empirical  performance  of  the  system  in  a  principled  way,  using  the  tools 
of  probability  theory.  In  addition,  we  can  leverage  the  work  that  has  been  car¬ 
ried  out  in  the  speech  recognition  community  in  the  past  thirty  years  on  such 
problems  as  smoothing  and  combining  language  models  for  multiple  topics  and 
collections.  The  language  modeling  approach  applies  naturally  to  a  wide  range 
of  information  system  technologies,  such  as  distributed  retrieval,  cross-language 
IR,  summarization  and  filtering. 

Much  remains  to  be  done  to  establish  language  modeling  as  a  unifying  frame¬ 
work.  We  need  to  show  how  language  models  can  represent  documents,  topics, 
databases,  languages,  queries,  and  even  people.  We  need  to  develop  efficient  algo¬ 
rithms  for  acquiring,  comparing,  and  summarizing  language  models  of  different 
types  and  granularities.  We  need  to  show  how  statistical  language  models  can 
describe  the  crucial  functions  in  a  language- based  information  system,  such  as 
information  retrieval,  filtering,  and  summarization.  Finally,  we  need  to  demon¬ 
strate  that  the  performance  of  the  language-based  functions  improves  as  a  result 
of  using  a  language  modeling  framework.  Figure  1  gives  an  overview  of  the  rep¬ 
resentation  and  function  aspects  of  the  language  model  framework. 


Topics,  Documents,  Collections, 
Languages,  Queries,  People 


Representation 

Language  Models 

Function 


IR,  Distributed  IR,  Translingual  IR,  Speech, 
TDT,  MT,  Summarization,  Organization, 
Extraction,  Mining,  Filtering 


Fig.  1.  Overview  of  Language  Model  Framework 
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A  project  addressing  these  issues  has  begun  as  a  collaboration  between  re¬ 
searchers  at  the  University  of  Massachusetts  and  Carnegie  Mellon  University. 

In  this  paper,  I  will  focus  on  a  more  detailed  description  of  the  issues  involved 
in  applying  language  models  to  information  retrieval.  The  next  section  describes 
how  language  models  can  transform  the  view  of  document  representations  and 
indexing  models.  Section  3  discusses  how  language  modeling  approaches  to  IR 
approach  relevance.  In  other  words,  how  is  the  concept  of  relevance  incorporated 
in  the  overall  retrieval  model.  Section  4  shows  how  combination  of  evidence  or 
data  fusion  can  be  implemented  using  language  models. 

2  Language  Models  and  Indexing 

Over  the  past  three  decades,  probabilistic  models  of  document  retrieval  have 
been  studied  extensively.  In  general,  these  approaches  can  be  characterized  as 
methods  of  estimating  the  probability  of  relevance  of  documents  to  user  queries. 
One  component  of  a  probabilistic  retrieval  model  is  the  indexing  model,  i.e.,  a 
model  of  the  assignment  of  indexing  terms  to  documents. 

A  well-known  example  of  an  indexing  model  is  the  2-Poisson  model  [8] .  The 
success  of  the  2-Poisson  model  has  been  somewhat  limited  but  it  should  be 
noted  that  Robertson’s  tf  weight,  which  has  been  quite  successful,  was  intended 
to  behave  similarly  to  the  2-Poisson  model  [18].  Other  probabilistic  indexing 
models  have  also  been  proposed  (e.g.  [7]). 

Estimating  the  probability  that  an  index  term  is  “correct”  for  that  document 
is  difficult.  As  a  result,  heuristic  tf.idf  weights  are  used  in  the  retrieval  algorithms 
based  on  these  models.  In  order  to  avoid  these  weights  and  the  awkwardness  of 
modeling  the  correctness  of  indexing,  Ponte  and  Croft  [16]  proposed  a  language 
modeling  approach  to  information  retrieval.  The  phrase  “language  model”  is  used 
by  the  speech  recognition  community  to  refer  to  a  probability  distribution  that 
captures  the  statistical  regularities  of  the  generation  of  language  [10].  Generally 
speaking,  language  models  for  speech  attempt  to  predict  the  probability  of  the 
next  word  in  an  ordered  sequence.  For  the  purposes  of  document  retrieval,  Ponte 
and  Croft  modeled  occurrences  at  the  document  level  without  regard  to  sequen¬ 
tial  effects,  although  they  showed  that  it  is  possible  to  model  local  predictive 
effects  for  features  such  as  phrases.  Mittendorf  and  Schauble  [13]  used  a  simi¬ 
lar  approach  to  construct  a  generative  model  for  retrieval  based  on  document 
passages. 

The  approach  to  retrieval  described  in  Ponte  and  Croft  [16]  is  to  infer  a  lan¬ 
guage  model  for  each  document  and  to  estimate  the  probability  of  generating 
the  query  according  to  each  of  these  models.  Documents  are  then  ranked  ac¬ 
cording  to  these  probabilities.  In  this  approach,  collection  statistics  such  as  term 
frequency,  document  length  and  document  frequency  are  integral  parts  of  the 
language  model  and  do  not  have  to  be  included  in  an  ad  hoc  manner.  The  score 
for  a  document  in  the  simple  unigram  model  used  in  Ponte  and  Croft  is  given 
by: 

P(Q\D)  =  []  PHD)  II  (!  -  PH£) 

wSQ  w£Q 
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where  P(Q\D)  is  the  estimate  of  the  probability  that  a  query  can  be  generated 
for  a  particular  document,  and  P(w\D)  is  the  probability  of  generating  a  word 
given  a  particular  document  (the  language  model). 

Much  of  the  power  of  this  simple  model  comes  from  the  estimation  techniques 
used  for  these  probabilities,  which  combine  both  maximum  likelihood  estimates 
and  background  models.  This  part  of  the  model  benefits  directly  from  the  ex¬ 
tensive  research  done  on  estimation  of  language  models  in  fields  such  as  speech 
recognition  and  machine  translation.  More  sophisticated  models  that  make  use 
of  bigram  and  even  trigram  probabilities  are  currently  being  investigated  [12, 

19]- 


The  idea  of  a  language  model  representing  the  text  written  in  specific  doc¬ 
uments  leads  directly  to  the  possibility  of  using  language  models  to  represent 
topics  in  domains  and  users’  views  of  domains.  Establishing  a  context  for  the 
query  is  a  crucial  part  of  achieving  effective  retrieval.  The  query  ’’star  wars”  can 
be  interpreted  very  differently  in  the  context  of  missile  defense  systems  rather 
than  Hollywood  films.  Many  approaches  have  been  tried  to  identify  and  use  con¬ 
text,  mostly  in  the  form  of  query  expansion  techniques.  For  example,  the  Local 
Context  Analysis  technique  [23]  identifies  words  and  phrases  associated  with  the 
query  context  by  analyzing  retrieved  documents.  This  technique,  although  one 
of  the  most  successful  in  terms  of  improving  retrieval  effectiveness,  is  ad-hoc  and 
cannot  distinguish  multiple  contexts  for  a  given  query.  The  language  model  ap¬ 
proach  appears  to  provide  a  more  principled  way  of  describing  and  using  context 
that  will  lead  to  substantially  more  effective  retrieval. 

Language  models  for  important  topics  could  be  based  on  groups  of  similar 
documents.  We  call  these  topic  models  to  distinguish  them  from  models  based 
on  individual  documents.  To  generate  topic  models  for  a  set  of  documents,  the 
documents  would  first  need  to  be  clustered  or  grouped,  and  then  a  model  could 
be  estimated  for  each  group.  Note  that  this  represents  a  different  form  of  the 
clustering  hypothesis  [17],  which  states  that  closely  associated  documents  tend  to 
be  relevant  to  the  same  requests.  Instead,  we  are  assuming  that  closely  associated 
documents  will  have  the  same  underlying  language  model.  Xu  and  Croft  [22] 
used  this  technique  to  represent  databases  using  multiple  language  models  for 
distributed  search. 

A  variation  of  this  approach  would  be  to  cluster  document  passages  and 
allow  multiple  topic  models  to  be  associated  with  a  given  document.  In  a  very 
similar  approach,  Hoffman  [9]  describes  how  mixture  models  based  on  latent 
classes  can  represent  documents  and  queries.  The  latent  classes  are  generated 
using  clustering  based  on  the  EM  (Expectation-Maximization)  algorithm  [11], 
and  Hoffman  shows  how  this  approach  is  related  to  Latent  Semantic  Indexing 
[6],  The  use  of  mixture  models  to  represent  queries  and  documents  makes  it  clear 
that  many  of  the  previous  uses  of  expansion  and  clustering  techniques  in  IR  can 
be  described  as  smoothing  techniques  in  the  language  model  framework. 


Information  Retrieval  Based  on  Statistical  Language  Models 


5 


3  Language  Models  and  Relevance 

The  Ponte  and  Croft  model  uses  a  relatively  simple  definition  of  relevance  that 
is  based  on  the  probability  of  generating  a  query  text.  This  definition  does  not 
easily  describe  some  of  the  more  complex  phenomena  involved  with  information 
retrieval.  The  language  model  approach  can,  however,  be  extended  to  incorporate 
more  general  notions  of  relevance.  For  example,  Berger  and  Lafferty  [1]  show  how 
a  language  modeling  approach  based  on  machine  translation  provides  a  basis  for 
handling  synonymy  and  polysemy.  Related  tasks,  such  as  question  answering 
and  summarization,  also  provide  a  challenge  for  a  model  of  relevance. 

Early  probabilistic  models  of  retrieval,  such  as  the  binary  independence 
model,  viewed  retrieval  as  a  classification  problem  [17].  Documents  were  treated 
as  belonging  to  either  the  relevant  ( R )  or  non-relevant  classes  for  a  particular 
query.  In  this  model,  documents  are  ranked  according  to  the  probability  P(R\D), 
or  the  probability  that  a  particular  document  D  belongs  to  the  relevant  class. 
Fuhr  [7]  extended  this  model  to  non-binary  document  representations  and  made 
the  conditioning  on  the  query  explicit.  In  his  model,  documents  are  ranked  by 
P(R\D,  Q).  Turtle  and  Croft  [20]  proposed  a  Bayesian  net  model  that  calculates 
P(I\D),  which  is  the  probability  that  a  particular  information  need  I  is  satis¬ 
fied  for  a  given  document.  This  is  a  different  way  of  describing  relevance,  but  is 
otherwise  quite  similar  to  the  previous  models.  In  this  model  (described  in  more 
detail  in  the  next  section) ,  the  query  is  represented  as  intermediate  propositions 
that  describe  the  information  need. 

Ponte  [15]  views  a  query  as  the  user’s  description  of  an  ideal  or  relevant 
document.  More  specifically,  the  description  is  treated  as  a  text  sample.  The 
task  of  the  retrieval  system,  in  his  view,  is  to  rank  documents  by  P(Q\Md), 
which  is  the  probability  that  the  query  text  can  be  generated  by  the  language 
model  M  associated  with  a  given  document  D.  This  view  of  retrieval,  however, 
does  not  easily  describe  the  question-answering  task  and  is  regarded  by  some 
researchers  to  be  an  inadequate  model  of  relevance. 

Miller  et  al  [12]  described  a  simple  probabilistic  model  for  ranking  documents 
by  P(D  is  R\Q),  which  is  described  as  the  probability  that  D  is  relevant  given 
the  query  Q.  Using  Bayes  Rule,  this  can  be  transformed  into 

P(Q\D  is  R)P(D  is  R) 

P(Q) 


This  model  is  somewhat  awkward,  and  although  the  term  P(Q\D  is  R)  is  treated 
in  Miller  et  al  [12]  as  being  the  same  as  the  Ponte  probability,  it  is  not  because 
of  the  constraint  that  the  document  is  relevant.  In  the  absence  of  relevance 
information,  it  is  difficult  to  apply  this  model. 

Berger  and  Lafferty’s  model  [1]  has  similarities  to  the  Ponte  model  in  that 
they  view  the  user  generating  a  query  as  a  sample  of  an  ideal  document.  The 
task  of  the  system  is  then  to  find  the  a  posteriori  most  likely  documents  given 
the  query  and  the  specific  user  U.  In  other  words,  documents  are  ranked  by 


P(D\Q,U) 


P(Q\DM)P(D\U) 

P(Q\U) 
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The  denominator  P(Q\U)  is  fixed  for  a  given  query  and  user.  The  term  P(D\U)  is 
a  “document  prior”  that  can  be  used,  for  example,  to  discount  short  documents. 
If  we  assume  a  uniform  prior,  this  is  the  same  as  the  Ponte  model. 

Another  formulation  of  the  retrieval  process,  which  we  are  currently  investi¬ 
gating,  views  the  query  as  a  sample  or  a  description  of  an  underlying  language 
model  Mq.  This  language  model  describes  the  information  need.  In  other  words, 
instead  of  the  user  having  a  “perfect  document”  in  mind,  this  approach  assumes 
that  the  user  has  some  idea  of  the  characteristics  of  good  documents  and  can  de¬ 
scribe  these  characteristics  in  terms  of  relative  frequencies,  co-occurrences,  and 
other  phenomena  that  can  be  captured  in  a  language  model.  The  task  of  the 
retrieval  system  is  viewed  as  first  to  estimate  Mq  and  then  use  this  model  to 
retrieve  documents  (or  answers). 

In  this  approach,  we  estimate  Mq  using  a  mixture  of  document  models.  A 
“pooled”  mixture  could  be  found  using  the  document  models  that  optimize 

m&x  P(Q\M{Tl...Tk})- 

This  part  of  the  retrieval  process  is  very  similar  to  the  Ponte  model,  but 
here  the  document  models  are  being  used  to  smooth  the  model  of  the  informa¬ 
tion  need.  Then,  for  each  document  in  the  collection,  we  compute  the  posterior 
likelihood  that  the  smoothed  model  Mq  is  the  source  from  which  D  was  gen¬ 
erated:  P(Mq\D).  Applying  Bayes  Rule,  we  rank  documents  by  the  equivalent 
log P(D\Mq) / P(D)  (the  prior  P(Mq)  does  not  affect  the  ranking).  This  model 
is  also  similar  to  the  Berger  and  Lafferty  approach,  but  there  are  important 
differences.  In  particular,  the  process  of  forming  the  language  model  of  the  in¬ 
formation  need  allows  the  query  to  be  something  other  than  a  text  sample. 
Queries  formulated  using  query  languages  (such  as  Boolean  operators)  or  as 
questions  can  be  accommodated.  More  development  of  this  model  of  relevance 
and  experimental  validation  remain  to  be  done. 

4  Language  Models  and  Combination  of  Evidence 

Combining  multiple  sources  of  evidence  about  relevance  has  been  shown  many 
times  to  be  an  effective  approach  to  IR. 

The  inference  network  framework,  developed  by  Turtle  and  Croft  [20]  and 
implemented  as  the  INQUERY  system  [5],  was  explicitly  designed  for  combining 
multiple  representations  and  retrieval  algorithms  into  an  overall  estimate  of  the 
probability  of  relevance.  This  framework  uses  a  Bayesian  network  [14]  to  rep¬ 
resent  the  propositions  and  dependencies  in  the  probabilistic  model  (Figure  2). 
The  network  is  divided  into  two  parts:  the  document  network  and  the  query 
network.  The  nodes  in  the  document  network  represent  propositions  about  the 
observation  of  documents  ( D  nodes),  the  contents  of  documents  (T  nodes),  and 
representations  of  the  contents  ( K  nodes).  Nodes  in  the  query  network  represent 
propositions  about  the  representations  of  queries  ( K  nodes  and  Q  nodes)  and 
satisfaction  of  the  information  need  (I  node).  This  network  model  corresponds 
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closely  to  a  framework  for  combining  classifiers  as  indicated  by  the  labels  on  the 
boxes  in  Figure  2. 


Fig.  2.  Bayesian  net  model  of  information  retrieval 


In  this  model,  all  nodes  represent  propositions  that  are  binary  variables  with 
the  values  true  or  false,  and  the  probability  of  these  states  for  a  node  is  deter¬ 
mined  by  the  states  of  the  parent  nodes.  For  node  A,  the  probability  that  A  is 
true  is  given  by: 

p(a)=  ^EhlK1-*) 

SC{l,...,n}  ieS  igS 

where  as  is  a  coefficient  associated  with  a  particular  subset  S  of  the  n  parent 
nodes  having  the  state  true,  and  pi  is  the  probability  of  parent  i  having  the  state 
true.  Some  coefficient  settings  result  in  very  simple  but  effective  combinations  of 
the  evidence  from  parent  nodes.  For  example,  if  as  =  0  unless  all  parents  have 
the  state  true,  this  corresponds  to  a  Boolean  and.  In  this  case,  p{A)  =  Yli=iPi- 
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The  most  commonly  used  combination  formulas  in  this  framework  are  the 
average  and  the  weighted  average  of  the  parent  probabilities.  These  formulas  are 
the  same  as  those  shown  in  other  research  to  be  the  best  combination  strategies 
for  classifiers  and  discussed  earlier  in  the  paper.  The  combination  formula  based 
on  the  average  of  the  parent  probabilities  comes  from  a  coefficient  setting  where 
the  probability  of  A  being  true  depends  only  on  the  number  of  parent  nodes 
having  the  state  true.  The  weighted  average  comes  from  a  setting  where  the 
probability  of  A  depends  on  the  specific  parents  that  are  true.  Parents  with 
higher  weight  have  more  influence  on  the  state  of  A.  The  INQUERY  search 
system  provides  a  number  of  these  “canonical”  combination  formulas  as  query 
operators.  The  three  described  above  are  #and,  #sum,  and  #wsum. 

In  the  INQUERY  system,  different  document  representations  are  combined 
by  constructing  nodes  corresponding  to  propositions  about  each  representation 
(i.e.  is  this  document  represented  by  a  particular  term  from  a  representation  vo¬ 
cabulary)  and  constructing  queries  using  those  representation  nodes.  The  queries 
for  each  representation  are  combined  using  operators  such  as  #uisum. 

In  the  inference  net  model,  the  probabilities  associated  with  the  query  node 
propositions  are  computed  from  the  probabilities  associated  with  representation 
nodes.  The  probabilities  associated  with  representation  nodes,  however,  can  be 
computed  from  evidence  in  the  raw  data  of  the  documents.  For  example,  a 
tf.  idf  formula  is  used  in  INQUERY  to  compute  the  probability  of  a  word-based 
representation  node  for  a  particular  document.  The  use  of  heuristic  estimation 
formulas,  the  lack  of  knowledge  of  prior  probabilities,  and  the  lack  of  training 
data  means  that  the  outputs  of  the  inference  net  (document  scores)  do  not 
correspond  closely  to  real  probabilities. 

The  language  model  framework  described  previously  can  readily  incorporate 
new  representations,  can  produce  accurate  probability  estimates,  and  can  be 
incorporated  into  the  general  Bayesian  net  framework.  Miller  et  al  [12],  point 
out  that  estimating  the  probability  of  query  generation  involves  a  mixture  model 
that  combines  a  variety  of  word  generation  mechanisms.  They  describe  this  com¬ 
bination  using  a  Hidden  Markov  Model  with  states  that  represent  a  unigram 
language  model  ( P(w\D )),  a  bigram  language  model  (P(wn\wn-i,  D)),  and  a 
model  of  general  English  ( P(w\English )),  and  mentions  other  generation  pro¬ 
cesses  such  as  a  synonym  model  and  a  topic  model.  Hoffman  [9],  and  Berger  and 
Lafferty  [1]  also  describe  the  generation  process  using  mixture  models,  but  with 
different  approaches  to  representation.  Put  simply,  incorporating  a  new  repre¬ 
sentation  into  the  language  model  approach  to  retrieval  involves  estimating  the 
language  model  (probability  distribution)  for  the  features  of  that  representation 
and  incorporating  that  new  model  into  the  overall  mixture  model.  The  stan¬ 
dard  technique  for  calculating  the  parameters  of  the  mixture  model  is  the  EM 
algorithm.  This  algorithm  can  be  applied  to  training  data  that  is  pooled  across 
queries  and  this,  together  with  techniques  for  smoothing  the  maximum  likeli¬ 
hood  estimates,  results  in  more  accurate  probability  estimates  than  a  system 
using  tf.  idf  weights  without  training,  such  as  INQUERY. 
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There  is  a  strong  relationship  between  the  Ponte  approach  to  language  mod¬ 
eling  for  IR  and  the  inference  net.  Figure  3  shows  the  unigram  language  model 
approach  represented  using  a  simplified  part  of  the  network  from  Figure  2.  The 
W  nodes  that  represent  the  generation  of  words  by  the  document  language  model 
replace  the  I\  nodes  representing  index  terms  describing  the  content  of  a  doc¬ 
ument.  The  Q  node  represents  the  satisfaction  of  a  particular  query.  In  other 
words,  the  inference  net  computes  the  value  of  P(Q  is  true).  In  the  Ponte  and 
Croft  model,  the  query  is  simply  a  list  of  words.  In  that  model,  Q  is  true  when 
the  parent  nodes  representing  words  present  in  the  query  are  true  and  the  words 
not  in  the  query  are  false.  The  document  language  model  gives  the  probabilities 
of  the  true  and  false  states  for  the  W  nodes. 

As  we  mentioned  in  the  last  section,  however,  we  can  regard  the  query  as  hav¬ 
ing  an  underlying  language  model,  similar  to  documents.  This  language  model 
is  associated  with  the  information  need  of  the  searcher  and  can  be  described 
by  P(Wi, . . . ,  Wn\Q).  This  probability  is  directly  related  (by  Bayes  rule)  to  the 
probability  P(Q\Wi, . . . ,  Wn )  that  is  computed  by  the  inference  network. 


Fig.  3.  The  language  model  approach  represented  in  a  Bayesian  net 


The  inference  network,  therefore,  provides  a  mechanism  for  comparing  the 
document  language  model  to  the  initial  specification  of  the  searcher’s  language 
model,  which  is  the  first  part  of  the  retrieval  process  described  in  the  last  section. 

5  Conclusion 

Research  on  language  model-based  information  retrieval  systems  is  beginning  to 
bear  fruit.  Experiments  indicate  that  these  systems  will  be  more  flexible  and 
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effective  than  systems  based  on  ad-hoc  approaches  or  other  probabilistic  mod¬ 
els.  There  have  also  been  some  early  promising  results  in  related  areas  such  as 
summarization  [21].  This,  combined  with  the  established  track  record  of  the  lan¬ 
guage  model  approach  for  tasks  such  as  speech  recognition,  provides  substantial 
encouragement  for  further  study  of  a  language  model  framework  for  integrating 
language  technologies.  The  increasing  interest  in  applications  such  as  question 
answering,  cross-language  retrieval,  and  information  mining  gives  additional  im¬ 
petus  to  the  development  of  this  framework. 
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Abstract.  The  anticipated  dynamics  of  the  future  battlefield  will  require  greatly 
increased  mobility,  information  flow,  information  assimilation,  and 
responsiveness  from  a  tactical  operation  center  (TOC)  and  platforms  (tanks, 
armored  personnel  carriers,  etc.).  Three  significant  and  related  trends  in  the 
evolution  of  the  tactical  battlefield  address  these  requirements.  The  first  is  the 
increased  automation  of  the  brigade  nerve  center  or  TOC.  Much  of  this 
automation  will  be  provided  by  software  agent  technology.  The  second  is  the 
digitization  of  current  battlefield  platforms.  This  digitization  greatly  reduces 
the  uncertainty  concerning  these  platforms  and  enables  automated  information 
exchange  between  these  platforms  and  their  TOC.  The  third  is  the  rapid 
development  of  robotics  or  physical  agents  for  numerous  battlefield  tasks  such 
as  clearing  buildings  of  hazards  (such  as  snipers)  or  performing  wingman 
functions  for  a  future  combat  vehicle.  This  paper  illustrates  the  potential 
synergy  between  these  seemingly  disparate  developments,  particularly  related 
to  battlefield  visualization,  multi-resolution  analysis,  software  agents,  and 
physical  agents.  Battlefield  visualization  programs  are  currently  focussed  on 
representing  the  physical  environment.  This  greatly  contributes  to  situation 
awareness  at  the  TOC  and  platform  levels.  As  intelligent  agents,  both  software 
and  physical  agents,  are  developed,  battlefield  visualization  must  be  enhanced 
to  include  the  state,  behavior,  and  results  of  the  actions  of  these  agents.  Multi¬ 
resolution  data  and  analysis  will  enhance  visualization,  software  agent  and 
physical  agent  performance. 


Introduction 

There  is  widespread  dissatisfaction  with  the  design  and  functionality  of  current  Army 
tactical  operation  centers  (TOCs)  [4].,  due  primarily  to  their  lack  of  mobility, 
inefficiency,  and  high  complexity.  The  extensive  hardware,  software,  and  manpower 
resources  needed  to  operate  a  current  TOC  severely  limit  the  required  mobility  needed 
for  a  future  nonlinear,  dynamic  battlefield.  A  greatly  increased  level  of  automation  is 
needed  both  to  significantly  lower  the  human  resources  required  and  to  improve 
information  flow.  Figure  1  depicts  the  size  and  mobility  envisioned  for  a  future 
TOC. 

The  TOC  exists  to  support  the  tactical  commander  in  understanding  the  current 
state  of  the  battlefield  and  in  predicting  its  future  state.  It  also  provides  planning, 
monitoring,  and  reaction  functions  to  the  commander.  The  situation  awareness  that 
results  enables  rapid  and  effective  decision  making  and  leadership.  Although  the 
TOC  is  the  information  and  control  center  of  the  tactical  battlefield,  it  must  also  be 
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able  to  project  its  critical  information  to  a  commander  on  a  remote  platform  such  as  a 
tank  or  helicopter,  observing  or  interacting  with  vital  positions  on  the  battlefield. 
Because  the  TOC  is  an  information  integration  and  fusion  node,  it  is  an  essential  part 
of  a  highly  distributed  and  mobile  force.  A  scalable,  extensible,  and  adaptable 
visualization  and  software  agent  architecture  and  rich  application  set  are  required  to 
achieve  the  increased  efficiency  envisioned.  Most  low-level  information  retrieval, 
dissemination,  and  analysis  will  be  performed  or  controlled  by  these  agents. 


Reduced  complexity  :  "■'■'.'■■■.'A:  *  CoMXHMton  umur 

MnhmnMUH,  '  .  *  MuWrmxtolVO 

High  mobility  #  Modeling  and  atmulatJon 


Fig.  1.  Mobile  future  TOC  concept. 

Battlefield  visualization  technology  and  software  agent  technology  are  closely 
linked  because  of  the  need  to  visualize  and  interact  with  both  the  agents  and  the 
results  of  their  analysis.  Automated  communications  between  the  TOC  and  its 
associated  platforms  (human  or  robotic)  will  be  agent  based.  The  digitization  of  the 
lower  echelons  of  the  army  strongly  enhances  the  coupling  of  the  TOC  and  the 
tactical  platforms,  enabling  the  automated  exchange  of  data  and  information,  as  well 
as  access  to  more  advanced  applications  by  means  of  an  agent  environment.  This 
automated  information  exchange  will  greatly  reduce  the  latency  of  information, 
reduce  uncertainty,  and  enable  a  more  real-time  control  system  approach  in  the 
battlefield.  Figure  2  illustrates  this  exchange,  where  agents  are  classified  according  to 
their  battlefield  functional  area. 

Physical  agents  are  expected  to  be  ubiquitous  on  the  future  battlefield,  significantly 
lowering  the  risk  to  our  soldiers.  They  will  be  present  in  a  myriad  of  shapes,  sizes, 
and  capabilities.  Because  these  physical  agents  are  to  complement  future  manned 
systems,  they  must  be  able  to  collaborate  not  only  amongst  themselves  but  also  with 
their  manned  partners.  Their  missions  will  range  from  scout  missions 
(reconnaissance,  surveillance,  and  target  acquisition)  to  urban  rescue.  Robotic 
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sentinels  and  remote  communication  systems  would  reduce  the  soldier  workload  of  a 
future  TOC.  Teams  of  small  robots  deployed  by  manned  or  unmanned  mother  ships 
will  explore  (for  hazards)  and  define  buildings  before  manned  occupation.  Figure  3 
depicts  an  urban  scenario  [1].  The  Army  has  both  cross-country  and  urban  mission 
robot  programs  in  development.  Robust  mobility,  collaborative  military  behavior, 
and  effective  soldier  robot  interaction  are  major  development  areas.  These  robots 
must  be  able  to  operate  in  these  battlefield  environments  approximately  at  the  same 
tempo  as  the  manned  forces. 


TOC 


Fig.  2.  TOC-platform  agent  interaction. 

The  information  gathered  by  these  agents  will  be  sent  to  a  mother  ship  or  TOC  and 
be  visualized  by  human  controllers.  The  high-level  control  and  interaction  between 
the  mother  ship  and  its  agents  will  be  based  on  software  agent  technology,  analogous 
to  the  TOC/platform  interaction.  Software  agents  will  be  monitoring  the  robot 
disposition  and  communicating  with  the  robot  controller.  A  future  combat  system 
could  be  augmented  by  these  small  robots,  thereby  increasing  its  urban  effectiveness. 

Software  Agent  Applications 

Figure  4  illustrates  the  relationship  between  software  agent  applications  and 
visualization.  Software  agents  provide  much  of  the  analysis  of  battlefield  data.  Both 
the  results  of  this  analysis  and  the  state  and  behavior  of  these  agents  need  to  be 
visualized.  Of  the  myriad  possible  battlefield  agent  applications,  this  paper  focuses 
on  several  that  require  scalability  and  extensibility  of  the  agent  approach. 
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Fig.  3.  Small  robot  urban  scenario 

Consider  initially  the  basic  sentinel  application,  where  agents  must  be  able  to 
dynamically  monitor  and  analyze  battlefield  activity  and  perform  alert  functions. 
These  agents  are  assigned  to  monitor  either  fixed  areas  on  the  battlefield  or  areas 
associated  with  entities  (fixed  or  moving).  The  following  are  two  examples  of 
monitor  agents  scenarios: 

1.  Assign  an  agent  to  monitor  a  specific  area  of  interest  where  if  enemy  armor  is 
detected  in  force  before  the  blue  force  occupies  the  nearby  hills,  the  blue 

2.  commander  and  the  maneuvering  units  must  be  alerted.  This  agent,  although 
fixed  spatially,  must  have  spatial  and  temporal  reasoning. 

3.  Assign  an  agent  to  monitor  a  maneuvering  blue  force  battalion,  and  alert  it  if  any 
enemy  radar  is  capable  of  detecting  it  as  it  performs  its  planned  maneuver  path. 
This  agent  has  mobility  (not  fixed  to  a  geographic  area)  in  addition  to  spatial 
reasoning. 

Although  these  sentinel  agent  applications  seem  simple,  significant  temporal  and 
spatial  reasoning  is  required  to  minimize  unnecessary  alerts. 

Now  consider  a  broader  agent  application  scenario.  The  TOC  brigade  commander 
has  selected  a  maneuver  course  of  action  plan  that  calls  for  the  synchronized 
movement,  enemy  engagement, and  logistics  resupply  of  the  brigade.  The  plan  has 
been  disseminated  and  the  maneuver  platforms  have  begun  executing  this  course  of 
action.  This  plan  implementation  stimulates  significant  agent  activity  both  in  the 
TOC  as  well  as  in  the  maneuver  platforms.  A  global  maneuver  monitor  agent  in  the 
TOC  interacts  with  the  maneuver  monitor  agents  in  the  platforms.  The  platform 
synchronization  monitor  agents  have  the  task  of  alerting  the  human  platform 
commander  if  the  maneuver  entity  cannot  execute  its  maneuver  plan.  This  agent 
would  also  alert  the  TOC  maneuver  monitor  agent  of  any  execution  problems.  A 
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TOC  intelligence  agent  continuously  monitors  and  retrieves  any  pertinent  enemy 
information  that  would  affect  this  operation.  For  example,  suppose  a  radar  is  detected 
near  the  planned  path  of  one  of  the  maneuver  battalions.  This  intelligence  agent 
alerts  both  the  TOC  maneuver  plan  agent  as  well  as  the  affected  platform  agents 
(maneuver  and  intelligence).  At  the  TOC,  a  fire  support  agent  generates  an  attack 
plan  to  disable  this  enemy  sensor  asset.  This  plan  is  presented  to  the  TOC 
commander  and  is  refused  because  the  commander  considers  the  available  fire 
support  assets  insufficient.  At  the  affected  platforms,  the  platform  maneuver  agent 
generates  a  reactive  maneuver  plan  and  if  acceptable  to  the  local  commander,  the  plan 
is  executed.  A  platform  logistics  monitor  agent  keeps  track  of  local  resources  (fuel, 
ammunition,  spare  parts,  etc.)  and  disseminates  this  information  to  the  TOC  logistics 
agent.  The  TOC  logistics  agent  continuously  monitors  the  resupply  plan  that  supports 
this  engagement.  If  the  planned  resupply  points  become  inadequate  because  of 
excessive  engagement  times  or  maneuver,  the  TOC  logistics  agent  redefines  the 
resupply  points. 


BATTLEFIELD  VISUALIZATION 

■  Software  Agent  State 

■  Physical  Agent  State 

■  Terrain  /  Features 

■  Communications 

■  Weather 

■  Entities 


Fig.  4.  Intelligent  agent  battlefield  applications  and  visualization. 


This  example  application  indicates  that  monitoring,  alerting,  dissemination  and 
retrieval  agents  are  needed  for  each  of  the  major  battlefield  functions  (such  as 
maneuver,  intelligence,  and  logistics)  at  both  the  TOC  and  the  lead  platforms.  Many 
applications  are  possible  within  each  of  the  functional  areas.  Some  of  which  may 
differ,  within  each  functional  area  such  as  maneuver,  at  the  TOC  and  the  platform. 
Because  of  the  complexities  inherent  in  creating  and  interacting  with  a  large  set  of 
agents,  it  is  essential  that  the  human/agent  interaction  be  intuitive  and  not 
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cumbersome.  Since  many  agent  applications  will  be  oriented  toward  entities  or  areas 
in  the  battlefield,  an  effective  battlefield  visualization  approach  representing  the 
agents  and  their  behaviors  is  essential. 

Battlefield  Visualization 

We  introduce  here  a  multi-resolution  approach  to  visualization  as  well  as  analysis. 
Most  of  the  current  emphasis  of  the  Army  battlefield  visualization  program  is  on 
providing  a  global  infrastructure  with  the  ability  to  visualize  the  battlefield 
environment  (terrain,  weather,  entities,  features,  communications,  etc.)  at  whatever 
resolution  is  required  and  available.  This  enables  the  commander  to  have  a  custom 
global  view  of  the  battlefield  as  well  as  a  high-resolution  local  view  to  support  critical 
decisions.  This  same  infrastructure  supports  high-fidelity  local  views  for  the  platform 
commanders  as  well  as  the  ability  to  jump  to  any  other  local  view  in  the  world  (as 
long  as  data  is  available)  to  support  training  or  preparation  for  deployment.  This 
scalability  provides  a  single  visualization  approach  suitable  for  both  TOC  and 
platform  applications,  including  robotic  platforms.  Figure  5  illustrates  a  coupled 
2D/3D  visualization  approach. 

A  2D/3D  approach  in  necessary  since  soldiers  are  very  familiar  with  two- 
dimensional  maps  and  can  maintain  their  global  situation  awareness.  However  the  2D 
representation  is  not  as  effective  for  visualization  of  high-resolution,  complex  terrain. 
3D  representation  is  excellent  for  high-resolution,  complex  terrain,  but  it  is  very  easy 
to  lose  a  global  perspective  (get  lost)  in  all  the  detail  presented.  Presenting  both  views 
simultaneously  eliminates  many  of  the  problems  inherent  in  a  single-view  approach. 


Fig.  5.  Coupled  2D/3D  visualization. 
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Many  sources  of  environmental  data  are  available,  albeit  with  widely  varying 
resolution  and  coverage.  It  is  therefore  necessary  for  any  visualization  system  to 
work  with  multresolution  data  (elevation  and  imagery).  Software  agents  will  use  this 
multiresolution  data  for  responsive  planning  and  mission  execution.  While  robots  do 
not  visualize,  they  must  reason  about  their  environment.  Although  the  robotic 
platforms  will  have  effective  local  perception,  this  multiresolution  environmental  data 
will  enable  them  to  create  reactive  plans(implemented  by  software  agents),  similar  to 
the  agent  activity  in  human  platforms. 

Military  planners  currently  use  digital  terrain  and  elevation  data  along  with  digital 
feature  data  to  plan.  Because  the  currently  available  elevation  data  are  so  coarsely 
sampled  (100  m  or  30  m  post  spacing),  these  planned  routes  may  contain  numerous, 
significant  obstacles.  In  order  to  traverse  these  routes,  the  manned  or  unmanned 
vehicles  must  sense  and  react  to  these  obstacles.  As  the  number  of  reactions 
increases,  the  time  to  complete  the  mission  also  increases.  Fortunately,  under  the 
battlefield  visualization  umbrella,  there  are  programs,  that  are  developing  the 
technology  to  both  rapidly  generate  and  visualize  much  higher  resolution  data  (1  m). 
This  would  enable  an  operator  to  visualize  the  planned  routes  and  manually  detect 
obstacles.  If  the  planning  and  execution  analysis  could  use  the  high-resolution  data, 
then  many  of  the  obstacles  that  fall  within  the  1  to  100  m  range  could  be  detected  and 
avoided  in  the  plan.  However,  the  cost  for  this  high-resolution  analysis  is  increased 
processing  time,  since  the  route-planning  algorithms  would  be  using  much  more  data. 
A  multiresolution  analysis  would  use  high-resolution  data  only  when  the 
environmental  complexity  required  it.  This  would  greatly  decrease  the  processing 
cost  for  most  areas.  Because  the  cost  for  reactive  planning  is  high,  particularly  in 
robotic  platforms,  significant  mission  savings  (time)  are  expected.  Figure  6  illustrates 
the  need  for  high-resolution  data. 

The  original  plan  developed  with  100  m  elevation  post  spacing  does  not  recognize 
a  significant  obstacle  to  the  planned  maneuver.  With  1  m  data,  the  resultant  plan  does 
not  require  reactive  planning. 


Agent/Visualization  Implementation 

The  Army  Research  Laboratory  (ARL)  and  the  University  of  Maryland  (UMD)  have 
recently  integrated  a  software  agent  architecture  with  a  2D/3D  multiresolution 
visualization  research  testbed  [3].  The  University  of  Maryland  has  developed  a 
software  agent  architecture  called  Interactive  Maryland  Platform  for  Agents 
Collaborating  Together  (IMPACT)  [2,5],  and  ARL  has  developed  a  large-scale 
battlefield  visualization  testbed,  the  Combat  Information  Processor  (CIP).  IMPACT 
was  used  to  agentize  the  legacy  client/server-based  CIP  and  provide  the  initial  sentinel 
agent  functionality  described  in  this  paper.  This  functionality  was  added  by 
agentizing  the  CIP  control  measure  and  entity  servers.  Figure  7  represents  the  human 
computer  interface  of  this  agent  application. 

Conclusions 

The  Army  must  take  advantage  of  the  synergy  between  its  visualization,  software 
agent,  and  physical  agent  technology  developments.  Without  a  holistic  approach, 
multiple  competing  visualization  and  software  agent  designs  will  proliferate.  Even 
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with  a  single  optimal  design  approach  for  human/agent  interaction,  this  research  and 
development  must  address  the  ability  of  the  human  controller  to  assimilate  and  act  on 
the  state  of  the  battlefield  and  direct  his  agents  rapidly  enough  to  satisfy  future 
battlefield  dynamics.  An  effective  physical  and  software  agent  interaction  would  be 
perceived  to  be  non-intrusive  and  would  provide  all  the  necessary  focussed 
information  for  rapid  decision  making.  A  software  agent  application  architecture  may 
be  sufficient  to  perform  many  of  the  manpower  intensive  tasks  at  both  the  TOC  and 
the  individual  platforms.  These  tasks  have  been  categorized  similarly  to  the 
battlefield  functional  areas.  Although  myriad  applications  are  possible,  spanning  a 
widely  dispersed  level  of  complexity,  a  number  of  low-level  applications  can  also  be 
very  effective  in  TOC  automation.  It  is  critical  that  the  agent  approach  be  scalable, 
extensible,  and  adaptable  to  address  the  broad  application  area  of  the  tactical 
battlefield.  Many  of  these  tasks  can  be  implemented  with  generic  low-  level  monitor, 
alert,  retrieve,  and  disseminate  functions. 


Multi-Resolution  Planning  &  Execution 


Low  Resolution 
Planning 


High  Resolution 
Planning  & 
Execution 


Fig.  6.  Multi-resolution  planning 

There  still  is  concern  that  the  human/agent  interaction  may  be  too  encumbering  for 
the  commanders  and  staff  involved.  Closely  coupling  the  agent  interaction  with 
battlefield  visualization  should  make  the  interaction  more  intuitive.  Also,  an 
embedded  training  application  for  decision  making  that  uses  an  this  agent  approach 
will  accelerate  the  acceptance  of  this  approach.  This  embedded  training  would 
include  the  ability  to  rapidly  construct  scenarios  to  continuously  improve  the 
commander’s  and  staffs  decision  making.  If  this  training  capability  is  embedded,  the 
operators  will  automatically  train  on  the  use  of  this  agent  approach  and  develop  a  trust 
in  these  agents. 
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Fig.  7.  Sentinel  visualization  interface. 
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Abstract.  The  field  of  evolutionary  computation  has  drawn  inspiration  from 
Darwinian  evolution  in  which  species  adapt  to  the  environment  through  random 
variations  and  selection  of  the  fittest.  This  type  of  evolutionary  computation  has 
found  wide  applications,  but  suffers  from  low  efficiency.  A  recently  proposed  non- 
Darwinian  form,  called  Leamable  Evolution  Model  or  LEM,  applies  a  learning 
process  to  guide  evolutionary  processes.  Instead  of  random  mutations  and  re¬ 
combinations,  LEM  performs  hypothesis  formation  and  instantiation.  Experiments 
have  shown  that  LEM  may  speed-up  an  evolution  process  by  two  or  more  orders  of 
magnitude  over  Darwinian-type  algorithms  in  terms  of  the  number  of  births  (or 
fitness  evaluations).  The  price  is  a  higher  complexity  of  hypothesis  formation  and 
instantiation  over  mutation  and  recombination  operators.  LEM  appears  to  be 
particularly  advantageous  in  problem  domains  in  which  fitness  evaluation  is  costly 
or  time-consuming,  such  as  evolutionary  design,  complex  optimization  problems, 
fluid  dynamics,  evolvable  hardware,  drug  design,  and  others. 


1  Introduction 

In  his  prodigious  treatise  “On  the  Origin  of  Species  by  Means  of  Natural  Selection,” 
Darwin  conceived  the  idea  that  the  evolution  of  species  is  governed  by  “one  general 
law,  leading  to  the  advancement  of  all  organic  beings,  namely,  multiply,  vary,  let  the 
strongest  live  and  the  weakest  die”  (Darwin,  1859).  In  such  biological  or  natural 
evolution,  new  organisms  are  created  via  asexual  reproduction  with  variation 
(mutation)  or  via  sexual  reproduction  (recombination).  The  underlying  assumption  is 
that  the  evolution  process  is  not  guided  by  some  “external  mind,”  but  proceeds  through 
semi-random  modifications  of  genotypes  through  mutation  and  recombination,  and 
progresses  to  more  advanced  forms  due  to  the  principle  of  the  “survival  of  the  fittest.” 
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In  Darwinian  evolution,  individuals  thus  serve  as  holders  and  transmitters  of  their 
genetic  material.  Their  life  experiences  play  no  role  in  shaping  their  offspring’s 
properties.  Jean-Baptiste  Lamarck’s1  idea  that  traits  learned  during  the  lifetime  of  an 
individual  could  be  directly  transmitted  to  progeny  has  been  rejected  as  biologically 
viable  because  it  is  difficult  to  construe  a  mechanism  through  which  this  could  occur.2 
Many  scientists  believe,  however,  that  there  is  another  mechanism  through  which 
learned  traits  might  influence  evolution,  namely,  the  so-called  Baldwin  effect 
(Baldwin,  1896).  This  effect  stems  from  the  fact  that,  due  to  learning,  certain 
individuals  can  survive  even  though  their  genetic  material  may  be  suboptimal.  In  this 
way,  some  traits  that  otherwise  would  not  survive  are  passed  on  to  the  next 
generations.  Some  researchers  argue  that  under  certain  conditions  such  learning  may 
actually  slow  genetic  change  and  thus  slow  the  progress  of  evolution  (Anderson, 
1997). 

More  than  a  century  after  Darwin  introduced  his  theory  of  evolution,  computer 
scientists  adopted  it  as  a  model  for  implementing  evolutionary  computation  (e.g., 
Holland,  1975;  Goldberg,  1989;  Michalewicz,  1996;  Koza  et  al.  1999).  Their  efforts 
have  led  to  the  development  of  several  major  approaches,  such  as  genetic  algorithms, 
evolutionary  strategy,  genetic  programming,  and  evolutionary  programs.  These  and 
related  approaches,  viewed  jointly,  constitute  the  rapidly  growing  field  of  evolutionary 
computation  (see,  e.g.,  Baeck,  Fogel,  M.  Mitchell;  1996,  Banzhaf  et  al.,  1999;  Zalzala, 
2000). 

Methods  of  evolutionary  computation  based  on  principles  of  Darwinian  evolution 
use  various  forms  of  mutation  and/or  recombination  as  variation  operators.  These 
operators  are  easy  to  implement  and  can  be  applied  without  any  knowledge  of  the 
problem  area.  Therefore,  Darwinian-type  evolutionary  computation  has  found  a  very 
wide  range  of  applications,  including  many  kinds  of  optimization  and  search  problems, 
automatic  programming,  engineering  design,  game  playing,  machine  learning, 
evolvable  hardware,  and  many  others. 

The  Darwinian-type  evolutionary  computation  is,  however,  semi-blind:  the 
mutation  is  a  random,  typically  small,  modification  of  a  current  solution;  the  crossover 
is  a  semi-random  recombination  of  two  or  more  solutions;  and  selection  is  a  sort  of 
parallel  hill  climbing.  In  this  type  of  evolution,  the  generation  of  new  individuals  is  not 
guided  by  principles  learned  from  past  generations,  but  is  a  form  of  the  trial  and  error 
process  executed  in  parallel.  Consequently,  computational  processes  based  on 
Darwinian  evolution  tend  to  be  not  very  efficient.  Low  efficiency  has  been  the  major 
obstacle  in  applying  Darwinian-type  evolutionary  computation  to  highly  complex 


1  Jean-Baptiste  Lamarck,  a  French  naturalist  (1744-1829),  who  proposed  a  theory  that  the 

experience  of  an  individual  can  be  encoded  in  some  way  and  passed  to  the  genome  of 
the  offspring. 

2  Recent  studies  show  that  Lamarckian  evolution  appears  to  apply  in  the  case  of  antibody 

genes  (Steel  and  Blanden,  2000). 
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problems.  The  objective  of  many  research  efforts  in  this  area  has  been  thus  to  increase 
the  efficiency  of  the  evolutionary  process. 

In  modeling  computational  processes  after  principles  of  biological  evolution,  the 
field  of  evolutionary  computation  has  followed  a  long-practiced  tradition  of  looking  to 
nature  when  seeking  technological  solutions.  The  imitation  of  bird  flying  by 
mythological  Icarus  and  Daedalus  is  an  early  example  of  such  efforts.  In  seeking 
technological  solutions,  the  “imitate-the-nature”  approach,  however,  frequently  does 
not  lead  to  the  best  engineering  results.  Modern  examples  of  successful  solutions  that 
are  not  imitations  of  nature  include  balloons,  automobiles,  airplanes,  television, 
electronic  calculators,  computers,  etc. 

This  paper  discusses  a  recently  proposed,  non-Darwinian  form  of  evolutionary 
computation,  called  Learnable  Evolution  Model  or  LEM.  In  LEM,  new  individuals  are 
created  by  hypothesis  formation  and  instantiation,  rather  through  mutation  or 
recombination.  This  form  of  evolutionary  computation  attempts  to  model  “intellectual 
evolution” — the  evolution  of  ideas,  technical  solutions,  human  organizations,  artifacts, 
etc. — rather  than  biological  evolution.  In  contrast  to  Darwinian  evolution,  an 
intellectual  evolution  is  guided  by  an  “intelligent  mind,”  that  is,  by  humans  who 
analyze  advantages  and  disadvantages  of  previous  generation  of  solutions  and  use  the 
developed  understanding  in  creating  next  generation  of  solutions.  It  is  due  to  the 
intellectual  evolution  that  the  process  of  evolving  the  automobile,  airplane  or 
computer  from  primitive  prototypes  to  modern  forms  was  astonishingly  rapid,  taking 
just  few  human  generations. 

The  idea  and  the  first  version  of  the  LEM  methodology  were  introduced  in 
(Michalski,  1998).  A  more  advanced  and  comprehensive  version  is  in  (Michalski, 
2000).  Its  early  implementation,  LEM1,  produced  very  encouraging  results  on  selected 
function  optimization  problems  (Michalski  and  Zhang,  1999).  Subsequent  experiments 
with  a  more  advanced  implementation,  LEM2,  confirmed  earlier  results  and  added 
new  highly  encouraging  ones  (e.g.,  Cervone  et  al.,  2000,  Cervone,  Kaufman  and 
Michalski,  2000). 

The  following  sections  briefly  describe  LEM  and  its  relationship  to  Darwinian-type 
evolutionary  computation,  and  then  summarize  results  of  testing  experiments. 


2  LEM  vs.  Darwinian  Evolutionary  Computation 

Darwinian-type  evolutionary  algorithms  can  be  generally  viewed  as  stochastic 
techniques  for  performing  parallel  searches  in  a  space  of  possible  solutions.  They 
simulate  natural  evolution  by  creating  and  evolving  a  population  of  individuals  until  a 
termination  condition  is  met.  Each  individual  in  the  population  represents  a  potential 
solution  to  a  problem.  Such  a  solution  can  be  represented  as  a  vector  of  parameters,  an 
instantiation  of  function  arguments,  an  engineering  design,  a  concept  description,  a 
control  strategy,  a  pattern,  a  computer  program,  etc.  A  precondition  for  applying  an 
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evolutionary  algorithm  is  the  availability  of  a  method  for  evaluating  the  quality 
(fitness)  of  individuals  from  the  viewpoint  of  the  given  goal. 


A  general  schema  of  an  evolutionary  computation  consists  of  the  following  steps: 

1.  Initialization 

t  :=  0 

Create  an  initial  population  P(t)  and  evaluate  fitness  of  its  individuals. 

2.  Selection 

t  :=  t+1 

Select  a  new  population  from  the  current  one  based  on  their  fitness:  P(t)  := 
Select(P(t-l)) 

3.  Modification 

Apply  change  operators  to  generate  new  individuals:  P(t)  :=  Modify(P(t)) 

4.  Evaluation 

Evaluate  fitness  of  individuals  in  P(t) 

5.  Termination 

If  P(t)  satisfies  the  termination  condition,  then  END,  otherwise  go  to  step  2. 

Different  evolutionary  algorithms  differ  in  the  way  individuals  are  represented, 
created,  evaluated,  selected  and  modified.  They  may  also  use  different  orders  of  steps 
in  the  above  schema,  employ  single  or  multiple  criteria  in  fitness  evaluation,  assume 
different  termination  conditions,  and  simultaneously  evolve  more  that  one  population. 
Some  algorithms  (specifically,  genetic  algorithms)  make  a  distinction  between  the 
search  space  and  the  solution  space.  The  search  space  is  a  space  of  encoded  solutions 
(“genotypes”),  and  the  solution  space  is  the  space  of  actual  solutions  (“phenotypes”). 
Encoded  solutions  have  to  be  mapped  onto  the  actual  solutions  before  the  solution 
quality  or  fitness  is  evaluated. 

As  mentioned  earlier,  in  Darwinian-type  (henceforth,  also  called  conventional) 
evolutionary  algorithms,  change  operators  are  typically  some  forms  of  mutation  and/or 
recombination.  Mutation  is  a  unary  transformation  operator  that  creates  new 
individuals  by  modifying  previous  individuals.  Recombination  is  an  n-ary  operator 
(where  n  is  typically  2)  that  creates  new  individuals  by  combining  parts  of  n 
individuals.  Both  operators  are  typically  semi-random,  in  the  sense  that  they  make 
random  modifications  within  certain  constraints. 

The  selection  operator  selects  individuals  for  the  next  population.  Typical  selection 
methods  include  proportional  selection  (the  probability  of  selecting  an  individual  is 
proportional  to  its  fitness),  tournament  selection  (two  or  more  individuals  compete  for 
being  selected  on  the  basis  of  their  fitness),  and  ranking  selection  (individuals  are 
sorted  according  to  their  fitness  and  selected  according  to  probabilities  associated  with 
different  ranks  on  the  sorted  list).  The  termination  condition  evaluates  the  progress  of 
the  evolutionary  process  and  decides  whether  to  continue  it  or  not. 


Learning  and  Evolution  25 


Leamable  Evolution  Model,  briefly,  LEM,  also  follows  this  general  schema.  Its 
fundamental  difference  from  Darwinian-type  algorithms  lies  in  step  4,  as  it  generates 
new  individuals  in  very  different  way.  In  contrast  to  semi-random  change  operators 
employed  in  Darwinian-type  algorithms,  LEM  conducts  a  reasoning  process  in 
generating  new  individuals.  Specifically,  it  applies  operators  of  hypothesis  formation 
and  hypothesis  instantiation. 

The  operator  of  hypothesis  formation  selects  from  a  population  a  group  of  high- 
performing  individuals,  called  the  H-group,  and  a  group  of  low-performing 
individuals,  called  the  L-group,  according  to  their  fitness.  The  H-group  and  L-group 
may  be  selected  from  the  current  population  or  from  a  sequence  of  past  populations. 
These  groups  can  be  selected  using  a  population-based  method,  a  fitness-based 
method,  or  a  combination  of  the  two.  The  population-based  method  applies  High  and 
Low  Population  Thresholds  (HPT  and  LPT)  in  selecting  individuals,  and  fitness-based 
method  applies  High  and  Low  Fitness  Thresholds  (HFT  and  LFT).  The  thresholds  can 
be  fixed  or  may  change  in  the  process  of  evolution.  For  details,  see  (Michalski,  2000). 

The  H-group  and  L-group  are  then  supplied  to  a  machine  learning  program  that 
generates  a  general  hypothesis  distinguishing  between  high  performing  from  low 
performing  individuals.  Such  a  hypothesis  can  be  viewed  as  a  theory  explaining  the 
differences  between  the  two  groups.  Alternatively,  it  can  be  viewed  as  a 
characterization  of  the  sub-areas  of  the  search  space  that  are  likely  to  contain  the  top 
performing  individuals  (the  best  solutions).  Once  such  a  hypothesis  has  been 
generated,  the  algorithm  generates  new  individuals  that  satisfy  the  hypothesis. 

In  principle,  any  inductive  learning  method  can  be  used  for  hypothesis  formation. 
LEM1  and  LEM2  implementations  of  the  LEM  methodology  has  used  the  AQ-type 
learning  method  (specifically,  AQ15  and  AQ18,  respectively;  see  Wnek  et  al.,  1995; 
Kaufman  and  Michalski,  2000b).  This  method  appears  to  be  particularly 
advantageous  for  LEM,  because  it  employs  attributional  calculus  as  the  representation 
language  (Michalski,  2000b).  Attributional  calculus  adds  to  the  conventional  logic 
operators  new  operators,  such  as  internal  disjunction ,  internal  conjunction,  attribution 
relation,  and  the  range  operator,  which  are  particularly  useful  for  characterizing 
groups  of  similar  individuals.  Attributional  calculus  stands  between  propositional 
calculus  and  predicate  calculus  in  terms  of  its  representational  power. 

New  individuals  are  generated  by  a  hypothesis  instantiation  operator  that 
instantiates  the  given  hypothesis  in  various  ways.  To  very  simply  illustrate,  suppose 
that  a  hypothesis  was  generated  by  an  AQ-type  learning  program  and  expressed  in  the 
form  of  two  attributional  rules  (these  rules  are  in  a  simplified  form  to  facilitate 
explanation): 


Rule  1:  [x  =  avc]&  [y=  2.3  ..  4]  &  [z  >  5]  (sup=80) 
Rule 2:  [x  =  bvdve]  &  [z  =  3.5  ..  6.4]  (sup=15) 


(1) 
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where  the  domains  of  attributes  x,  y,  and  z  are:  D(x)  =  {a,b,c,d,e,f}(  and  D(y)  and 
D(z)  range  over  real  numbers  between  0  and  10. 

Rules  in  (1)  characterize  two  subareas  of  the  search  space  that  contain  high 
performing  individuals.  The  first  rule  states  that  high  performing  individuals  appear  in 
the  area  in  which  the  variable  x  has  value  a  or  c,  the  variable  y  takes  value  between  2.3 
and  4,  and  the  variable  z  takes  value  greater  than  5.  The  parameter  sup  (support) 
indicates  that  this  rules  covers  80  individuals  in  the  H-group.  The  second  rule 
describes  an  alternative  set  of  conditions,  namely,  that  high  performing  individuals 
appear  also  in  the  area  in  which  differ  x  takes  value  b  or  d  or  e,  and  z  takes  values 
from  the  real  interval  between  3.5  and  6.4.  The  second  rule  covers  sup=15  individuals 
in  the  H-group.  Note  that  Rule  2  does  not  include  variable  y.  This  means  that  this 
variable  was  found  irrelevant  for  differentiating  between  high  and  low  performing 
individuals. 

The  hypothesis  (1)  is  a  generalization  of  the  set  of  individuals  in  the  H-group.  Thus, 
it  may  potentially  cover  many  other,  unobserved  individuals.  The  instantiation 
operator  instantiates  the  hypothesis  in  different  ways,  that  is,  generates  different 
individuals  that  satisfy  conditions  of  the  rules.  For  example,  using  hypothesis  (1),  the 
operator  may  generate  such  individuals  as: 

<a,  2,  6>,  <c,  3.5,  9.1>,  <a,  2.1,  6.4>  (based  on  Rule  1) 

<d,  2,  6>,  <e,  5.5,  4.3>,  <b,  2.2,  4.5>  (based  on  Rule  2)  (2) 

Since  variable  y  is  not  present  in  Rule  2,  any  values  of  y  could  be  selected  from 
D(y)  to  instantiate  this  rule.  In  our  experiments,  variables  not  present  in  the  rule  were 
instantiated  to  values  selected  randomly  from  among  those  that  appeared  in 
individuals  of  the  training  set  (H-group  and  L-group). 

The  newly  generated  individuals  are  combined  with  the  previous  ones,  and  a  new 
population  is  selected  using  some  selection  method.  Again,  an  H-group  and  L-group 
are  generated  and  operations  of  hypothesis  generation  and  instantiation  are  repeated. 
The  process  continues  until  a  LEM  termination  condition  is  met,  e.g.,  the  (presumably) 
global  or  a  satisfactory  solution  has  been  found. 

The  above-described  process  of  creating  new  individuals  by  operators  of  hypothesis 
formation  (through  inductive  generalization)  and  hypothesis  instantiation  (by 
generating  individuals  satisfying  the  hypothesis)  constitutes  the  Machine  Learning 
Mode  of  LEM.  A  general  form  of  LEM  includes  two  versions:  uniLEM,  which 
repetitively  applies  the  Machine  Learning  mode  until  a  termination  condition  is 
satisfied,  and  duoLEM ,  which  toggles  between  Machine  Learning  and  Darwinian 
Evolution  mode,  switching  from  one  mode  to  another  when  the  termination  condition 
for  the  given  mode  is  satisfied  (when  there  is  little  progress  in  executing  the  mode). 
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The  Darwinian  Evolution  Mode  executes  one  of  the  existing  conventional 
evolutionary  algorithms. 

A  comprehensive  explanation  of  various  details  of  the  LEM  methodology  and  its 
variants  is  in  (Michalski,  1999). 

3  A  Simple  Illustration  of  LEM 

To  illustrate  LEM,  let  us  consider  a  very  simple  search  problem  in  a  discrete  space. 
The  search  space  is  spanned  over  four  discrete  variables:  x,  y,  w,  and  z,  with  domains 
{0,1},  {0,1},  {0,1},  and  {0,1,2},  respectively.  Figure  1A,  presents  this  space  using  the 
General  Logic  Diagram  or  GLD  (Michalski,  1978;  Zhang,  1997).  Each  cell  of  the 
diagram  represents  one  individual.  For  example,  the  uppermost  cell  marked  by  7 
represents  the  vector:  <0, 0,  0,  2>.  The  initial  population  is  visualized  by  cells  marked 
by  dark  dots  (Figure  1  A).  The  numbers  next  to  the  dots  indicate  the  fitness  value  of  the 
individual.  The  search  goal  is  to  determine  individuals  with  the  highest  fitness, 
represented  by  the  cell  marked  by  an  x  (with  the  fitness  value  of  9). 
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Figure  1.  The  search  space  and  four  states  of  the  LEM  search  process. 
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We  assume  that  descriptions  discriminating  between  an  H-group  and  an  L-group  are  in 
the  form  of  attributional  rules  learned  AQ-type  learning  programs.  Figure  IB  presents 
the  H-group  individuals  (the  gray-shaded  cells)  and  L-group  individuals  (crossed 
cells)  determined  from  the  initial  population.  The  shaded  areas  in  Figure  1C  represent 
two  attributional  rules  discriminating  between  the  H-group  and  the  L-group:  [w  =  0] 
&  [z  =  1  v  2]  and  [y  =  1]  &  [w  =  1]  &  [z  =  0  v  1], 

Figure  ID  shows  individuals  in  the  H-group  (shaded  cells)  and  the  L-group  (crossed 
cells)  generated  by  instantiating  rules  in  Figure  1C.  The  shaded  area  in  Figure  ID 
represents  a  rule  that  discriminates  between  these  groups:  [x— 0]  &  [y=l]  &  [w=0]  & 
[z=lv2].  This  rule  was  obtained  through  incremental  specialization  of  the  parent  rule, 
and  covers  two  individuals.  The  global  solution  will  be  located  in  the  next  iteration. 


4  Summary  of  Testing  Experiments 

To  test  the  LEM  methodology,  it  has  been  implemented  in  a  general-purpose  form  in 
programs  LEM1  (Michalski  and  Zhang,  1999)  and  LEM2  (Cervone,  1999).  It  was  also 
employed  in  program  ISHED1,  specifically  tailored  to  problems  of  optimizing  heat 
exchangers  (Kaufman  and  Michalski,  2000a).  Both  LEM1  and  LEM2  were  applied 
to  a  range  of  function  optimization  problems.  LEM1  was  also  successsulfy  applied  to 
a  problem  in  filter  design  (Colleti  et.  al,  1999).  LEM2  was  tested  in  a  wide  range  of 
experiments  dealing  with  optimizing  different  types  of  functions  with  different 
numbers  of  arguments,  ranging  from  4  to  180  continuous  variables. 

In  all  experiments  LEM2  strongly  outperformed  conventional  evolutionary 
computation  algorithms  employed  in  the  study,  frequently  achieving  two  or  more  order 
of  magnitude  speedups  in  terms  of  the  number  of  births  (or  function  evaluations). 
Results  from  LEM2  were  also  significantly  better  than  the  best  results  from 
conventional  evolutionary  algorithms  published  on  a  website.  These  and  other  recent 
results  have  been  described  in  (Cervone  et.  al,  2000a;  Cervone  et  al.,  2000b).  Results 
from  experiments  with  ISHED1  were  presented  in  (Kaufman  and  Michalski,  2000a). 
According  to  the  collaborating  expert,  ISHEDl’s  heat  exchanger  designs  were 
comparable  to  the  best  human  designs  in  the  case  of  uniform  flow  of  refrigerant,  and 
were  superior  to  the  best  human  designs  in  the  case  of  non-uniform  flow. 

5  Conclusion 

Experimental  studies  conducted  so  far  have  strongly  demonstrated  that  the  proposed 
Learnable  Evolution  Model  can  significantly  speed  up  evolutionary  computation 
processes  in  terms  of  the  number  of  births  (or  fitness  evaluations).  These  speed-ups 
have  been  achieved  at  the  cost  of  higher  complexity  of  operators  generating  new 
individuals  (hypothesis  formation  and  instantiation).  An  open  problem  is  thus  to  study 
trade-offs  associated  with  the  LEM  application  to  different  problem  domains.  It  is 
safe  to  say,  however,  that  LEM  is  likely  to  be  highly  advantageous  in  problem  areas  in 
which  computation  of  the  evaluation  function  is  costly  or  time-consuming.  Such  areas 
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include  engineering  design,  complex  optimization  problems,  fluid  dynamics, 
evolvable  hardware,  drug  design  and  automatic  programming. 

Another  limiting  aspect  of  LEM  is  that  in  order  to  apply  it,  the  machine  learning 
system  must  be  able  to  work  with  the  given  representation  of  individuals.  For 
example,  if  individuals  are  represented  as  attribute-value  vectors,  rule  and  decision 
tree  learning  systems  can  be  applied.  If  they  are  represented  as  relational  structures,  a 
structural  learning  system  must  be  employed. 

Concluding,  among  the  open  problems  for  further  research  on  LEM  are  to  understand 
the  benefits,  trade-offs,  advantages  and  disadvantages  of  LEM  versus  Darwinian-type 
evolutionary  algorithms  in  different  problem  domains. 
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Abstract.  A  key  step  of  supervised  learning  is  testing  whether  a  can¬ 
didate  hypothesis  covers  a  given  example.  When  learning  in  first  order 
logic  languages,  the  covering  test  is  equivalent  to  a  Constraint  Satisfac¬ 
tion  Problem  (CSP).  For  critical  values  of  some  order  parameters,  CSPs 
present  a  phase  pransition,  that  is,  the  probability  of  finding  a  solution 
abruptly  drops  from  almost  1  to  almost  0,  and  the  complexity  drama¬ 
tically  increases.  This  paper  analyzes  the  complexity  and  feasibility  of 
learning  in  first  order  logic  languages  with  respect  to  the  phase  transition 
of  the  covering  test. 


1  Introduction 

This  paper  is  concerned  with  supervised  learning  from  structured  examples,  ter¬ 
med  relational  learning  [Qui90]  or  Inductive  Logic  Programming  (ILP)  [MDR94], 
Relational  learning  involves  an  additional  difficulty,  compared  to  learning  in 
attribute- value  languages:  in  first  order  logic,  the  covering  test  —  testing  whether 
a  candidate  hypothesis  covers  a  given  example  —  can  be  formulated  as  a  Con¬ 
straint  Satisfaction  problem  (CSP),  which  is  a  NP-hard  task.  The  phase  tran¬ 
sition  manifests  itself  as  an  abrupt  change,  with  respect  to  some  order  para¬ 
meters  of  the  problem  class,  of  the  probability  for  a  problem  instance  to  be 
satisfiable.  This  change  is  usually  coupled  with  a  peak  in  computational  com¬ 
plexity  [HHW96];  the  “hardest-on-average”  instances  lie  in  the  phase  transition, 
or  mushy  region. 

Previous  work  has  provided  strong  evidence  of  the  existence  of  a  phase  tran¬ 
sition  for  the  relational  covering  test  [GSOO],  The  mushy  region  was  empirically 
localized  and  found  to  be  relevant  to  relational  learning;  this  region  is  very  likely 
to  be  visited  when  learning  non-toy  relational  concepts.  Even  worse,  the  phase 
transition  might  have  deceptive  effects  on  the  learning  search;  according  to  preli¬ 
minary  experiments,  the  mushy  region  seems  to  attract  the  learning  search,  no 
matter  what  the  region  of  the  “true”  target  concept  is. 

In  this  paper,  we  focus  on  the  actual  effects  of  the  phase  transition  on  rela¬ 
tional  learning,  through  systematic  experiments  on  artificial  problems.  We  de¬ 
fine  some  hundreds  of  target  concepts,  located  in  the  under-constrained,  over¬ 
constrained  and  mushy  regions.  For  any  given  target  concept,  we  construct  a 
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learning  and  a  test  set.  The  well-known  top-down  relational  learner  FOIL  [Qui90] 
is  run  on  these  training  sets,  and  the  theories  learnt  by  FOIL  are  compared  to 
the  true  target  concepts.  A  Failure  Region  appears,  where  FOIL  fails  to  either 
identify  or  accurately  approximate  the  target  concept.  The  reasons  for  FOIL’s 
behavior  are  analyzed  and  some  explanations  are  given. 

2  Phase  Transition  in  Hypothesis  Testing 

We  restrict  ourselves  to  the  simplest  case  of  concept  learning  in  first  order  logic, 
i.e.,  learning  a  0-ary  conjunctive  relation.  The  target  concept  ip  is  thus  described 
as  a  conjunctive  formula  implicitly  existentially  quantified.  Let  E  be  a  universe 
where  <p  is  evaluated.  E  is  said  a  positive  example  of  p  if  it  contains  a  model  of 
ip,  and  a  negative  example  otherwise. 

Let  us  briefly  recall  our  previous  results.  Let  any  example  E  be  given  as  a 
conjunction  of  ground  literals  ..ViK),  where  a*  is  a  predicate  symbol  and 

Vi  denotes  a  constant  of  the  application  domain  [MDR94] .  The  set  of  literals  built 
on  a  same  predicate  symbol  a,  is  termed  a  relation.  The  complexity  of  example 
E  is  characterized  from  two  parameters:  the  number  L  of  distinct  constants  and 
the  average  size  N  of  the  relations  occurring  in  E.  Similarly,  the  complexity 
of  hypothesis  ipt  is  characterized  from  its  number  n  of  variables  and  its  total 
number  m  of  literals.  The  CSP  defined  as  “test  whether  <pt  covers  E"  is  finally 
characterized  from  the  4-tuple  (n,  N,m,  L). 

Extending  the  work  by  [Pro96] ,  [BGS99]  have  shown  the  occurrence  of  a  phase 
transition  in  the  covering  test  with  respect  to  the  order  parameters  ( m,L ).  Fig. 
1(a)  plots  the  probability  Pcov  for  a  hypothesis  to  cover  an  example  as  a  function 
of  m  and  L;  n  and  N  are  set  to  10  and  100,  respectively.  The  contour  plots  of 


Fig.  1.  (a)  Pcov  in  the  plane  (m,  L)\  n  =  10,  N  =  100;  (b)  Pco„  =  0.5  for  n  =  4,  6,  10, 
14;  N  =  100. 


the  crossover  point  (Pcov  =  .5)  for  different  numbers  n  of  variables  is  given  in 
Fig.  1(b);  the  phase  transition  shifts  toward  the  upper  right  as  n  increases,  and 
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the  computational  cost  (not  shown  here)  increases  exponentially.  Let  <p  be  a 
concept  and  let  mip  denote  its  number  of  literals;  let,  moreover,  L<^>cr  be  the 
critical  number  of  constants  for  which  the  pair  (m^,  L<piCr)  falls  within  the  phase 
transition  region.  Any  problem  of  deciding  whether  tp  covers  an  aexample  E  lies 
on  the  vertical  line  m  =  in  the  above  landscape  (Fig.  ??).  Depending  on  the 
number  L  of  constants  in  E,  three  possibilities  are  distinguished.  For  L  >  LmtCr, 
the  covering  test  lies  in  the  over-constrained  region  (NO  region);  assuming  that 
E  corresponds  to  a  uniformly  generated  universe,  E  is  almost  surely  a  negative 
example  of  ip.  Symmetrically,  when  L  <  Ltp}Cr,  the  covering  test  lies  in  the 
under-constrained  region  ( YES  region),  and  E  would  be  a  positive  example  of 
ip.  Last,  if  L  k,  LiptCr,  E  might  be  a  positive  or  negative  example  with  about 
equal  probability  (again,  assuming  a  uniform  random  generation  of  E).  This 
implies  that  two  different  concepts  having  rritp  literals  cannot  be  distinguished 
with  respect  to  their  coverage  of  uniformly  generated  examples,  except  if  those 
examples  involve  about  LyjiCr  constants. 

On  the  contrary,  assuming  a  non-random  example  distribution,  we  may  ex¬ 
pect  any  rate  of  successful  and  unsuccessful  covering  tests  in  any  region  of  the 
plane  (m,  L).  In  the  following,  both  random  and  non-random  distributions  will 
be  considerd. 


3  Experiment  Goal  and  Setting 

Our  study  is  based  on  artificial  learning  problems.  Each  learning  problem  is 
characterized  as  a  triplet  (<p,  Sl ,  St),  where  <p  denotes  the  “true”  target  concept, 
and  SL  and  &r  respectively  denote  the  learning  and  the  test  sets.  Two  restrictions 
have  been  done:  <p  only  contains  binary  predicates  and  all  predicates  in  the 
examples  are  relevant,  i.e.  they  appear  in  ip. 

A  total  number  of  451  problems  have  been  constructed,  each  characterized 
by  a  4-tuple  (n,N,m,L).  The  number  m  of  literals  in  i p  varies  in  the  interval 
[5  -9-30] ;  the  number  L  of  constants  in  the  examples  varies  in  the  interval  [11-^40]. 
The  number  n  of  distinct  variables  in  :p  is  set  to  4;  the  relation  size  is  set  to 
N  =  100  in  all  examples.  In  this  way,  a  wide  region  of  the  plane  (m,  L),  including 
the  phase  transition,  is  covered. 

The  problems  have  been  generated  using  the  random  generator  described  by 
[BGS99],  which  guarantees  a  uniform  distribution  of  both  the  target  concepts 
and  the  examples.  All  training  and  test  sets  contain  100  positive  and  100  negative 
examples  each.  As  noted  earlier,  such  an  even  distribution  is  quite  unlikely  when 
the  pair  (m,  L )  falls  outside  the  mushy  region  and  the  examples  follow  a  uniform 
distribution.  We  thus  repair  the  training  and  test  sets,  by  turning  some  negative 
examples  into  positive  ones,  by  adding  a  model  of  <p  when  <p  belongs  to  the  NO 
region;  symmetrically,  some  positive  examples  are  turned  into  negative  ones  by 
removing  all  models  of  <p,  when  <p  belongs  to  the  YES  region. 

The  relational  learning  goal  is  to  discover  either  the  very  description  of  the 
target  concept  (p,  or  some  accurate  approximation  <p  of  it.  Besides  the  computa¬ 
tional  cost,  two  issues  have  been  specifically  considered: 
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Predictive  accuracy.  As  usual,  the  accuracy  is  given  by  the  percentage  of  test 
examples  correctly  classified  by  the  hypothesis  <p  produced  by  the  learner.  The 
accuracy  is  considered  satisfactory  iff  it  is  greater  than  80%  (the  point  of  this 
threshold  value  will  be  discussed  later  on). 

Concept  identification.  It  must  be  emphasized  that  a  high  predictive  accu¬ 
racy  does  not  imply  that  the  learner  has  discovered  the  actual  target  concept  ip. 
The  two  issues  must  therefore  be  distinguished.  The  identification  is  considered 
satisfactory  iff  the  structure  of  <p  is  close  to  that  of  the  true  target  concept  ip, 
i.e.,  if  <p  is  conjunctive. 

Most  experiments  have  been  done  using  FOIL  [Qui90],  which  basically  per¬ 
forms  a  top-down  exploration.  It  starts  with  the  most  general  hypothesis,  and 
iteratively  specializes  its  current  hypothesis  <pt  by  taking  its  conjunction  with  the 
“best”  literal  ai(xj,Xk )  according  to  some  statistical  criterion  (Information  Gain 
[Qui90]  or  Minimum  Description  Length  (MDL)  [Ris78]).  When  specializing  fur¬ 
ther  the  current  hypothesis  does  not  improve  the  criterion,  ipT  is  retained,  all 
positive  examples  covered  by  pT  are  removed  from  the  training  set,  and  the  se¬ 
arch  is  restarted,  unless  the  training  set  is  empty.  The  final  hypothesis  <p  returned 
by  FOIL  is  the  disjunction  of  all  retained  partial  hypotheses  ipT. 

4  Results 

Predictive  Accuracy.  Each  relational  problem  ( p,£l,Ct )  is  defined  from  the 
size  m  of  the  target  concept  and  the  number  L  of  constants  in  the  examples,  and 
represented  as  a  point  in  the  plane  (m,  L).  Fig  2  reports  the  predictive  accuracy 
of  the  hypotheses  <p  learned  with  FOIL  A  successful  case  (resp.  a  failure  case) 
corresponds  to  a  “+”  (resp.  “.”). 


Fig.  2.  Relational  learning  with  FOIL.  The  Failure  region  (.)  and  the  Success  region 
(+).  The  upper  (resp.  lower)  curve  indicates  the  ( m,L )  points  such  that  any  random 
concept  <p  with  m  literals  subsumes  with  probability  .1  (resp.  .9)  a  random  example 
generated  from  L  constants. 
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There  are  marked  differences  between  successful  and  failure  cases:  the  pre¬ 
dictive  accuracy  usually  is  either  very  high  (>  95%)  or  comparable  to  that  of 
random  guessing  (<  58%)  (Table  1).  Other  experiment  made  with  other  learners 
suggest  that  the  failure  region  seems  almost  independent  from  the  success  crite¬ 
rion  and  the  learning  strategy.  Overall,  the  experiments  suggest  that  relational 
learning  succeeds  iff  either  the  target  concept  is  sufficiently  small  (m  <  6),  or 
the  relational  problem  is  sufficiently  far  away  from  the  phase  transition.  The 
latter  condition  was  unexpected,  as  it  states  that  longer  concepts  (extreme  right 
region)  might  be  easier  to  learn  than  shorter  ones  (close  to  the  phase  transition). 
This  point  will  be  discussed  further  in  Section  5. 

Concept  Identification.  Table  1  reports  the  characteristics  of  the  learnt  theo¬ 
ries  versus  the  target  concept:  the  first  two  columns  recalls  the  coordinates  m 
and  L  of  the  relational  learning  problem;  columns  3  and  4  give  the  number  of 
conjunctive  hypotheses  in  tp  and  their  average  number  of  literals,  respectively. 
Columns  5,  6  and  7  give  the  predictive  accuracy  of  <p  on  the  training  and  test 
sets,  and  the  computational  cost  of  learning  (in  seconds  on  a  Sparc  Enterprise 
450).  Last  column  gives  the  problem  category,  explained  below.  Table  1  shows 
three  categories  of  relational  problems. 


Table  1.  Target  concept  ip  and  learnt  hypothesis  ip 
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E.  Easy  problems.  FOIL  finds  a  conjunctive  hypothesis  ip  which  equals  (p  or 
differs  from  <p  by  at  most  one  literal,  and  correctly  classifies  (almost)  all  training 
and  test  examples.  Easy  problems  lie  in  the  YES  or  in  the  mushy  regions,  for 
low  values  of  m. 

A.  Approximable  problems.  FOIL  finds  a  conjunctive  hypothesis  <p,  which  correc¬ 
tly  classifies  (almost)  all  training  and  test  examples,  but  largely  over-generalizes 
<p  ( e.g .  ip  has  6  literals  instead  of  18). 

Approximable  problems  are  mostly  in  the  NO  region,  far  away  from  the  phase 
transition. 

H.  Hard  problems.  FOIL  learns  a  disjunctive  hypothesis  <p,  involving  many  con- 
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junctive  hypotheses  cpT  (between  6  and  15)  of  various  sizes,  and  each  cpT  only 
covers  a  few  training  examples.  The  predictive  accuracy  of  (p  is  not  much  better 
than  random  guess  on  the  test  set.  In  other  words,  such  cases  involve  the  emer¬ 
gence  of  the  true  concept  is  a  conjunctive  one.  Incidentally,  the  computational 
cost  reaches  its  maximum  for  hard  problems;  this  results  from  the  number  of 
hypotheses  learned  and  from  the  fact  that  they  lie  close  to  the  phase  transition 
(see  next  paragraph). 

Hard  problems  lie  on  or  close  to  the  phase  transition,  for  high  values  of  m. 

These  results  confirm  the  fact  that  a  high  predictive  accuracy  does  not  im¬ 
ply  that  the  true  concept  ip  has  been  discovered.  It  is  true  that  FOIL  succeeds 
whenever  it  correctly  discovers  a  single  conjunctive  concept  tp:  but  ip  might  be  a 
wild  generalization  of  <p.  Obviously,  there  is  no  way  one  can  distinguish  between 
easy  and  approximable  problems  in  real-world  applications. 

Location  of  the  hypotheses.  Let  us  examine  the  hypotheses  learnt  by  FOIL. 
Except  for  those  easy  problems  located  in  the  YES  region,  conjunctive  hypo¬ 
theses  tpT  lie  in  the  mushy  region  (Fig.  3).  More  precisely,  in  the  easy  problems 
located  in  the  mushy  region,  FOIL  discovers  the  true  concept;  in  approxima¬ 
ble  problems,  FOIL  discovers  a  generalization  of  the  true  concept,  lying  in  the 
mushy  region;  in  hard  problems,  FOIL  retains  seemingly  random  disjuncts,  most 
of  them  lying  in  the  mushy  region.  As  previously  noted  [GSOO] ,  the  phase  tran¬ 
sition  behaves  as  an  attractor  of  the  learning  search. 


Fig.  3.  Histogram  of  the  conjunctive  hypotheses  < pt. 


5  Interpretation 

The  above  results  raise  at  least  three  questions.  Why  does  the  learning  search 
end  up  in  the  mushy  region  ?  When  and  why  is  the  target  concept  correctly 
identified  ?  When  and  why  should  a  relational  learner  fail  to  approximate  the 
target  concept  ?  Some  tentative  answers  are  proposed  in  this  section. 


Can  Relational  Learning  Scale  Up? 


37 


5.1  The  Phase  Transition  Is  an  Attractor 

FOIL  constructs  a  series  of  candidate  hypotheses.  It  starts  with  a  single  literal 
tfx,  and  specializes  <pt  to  obtain  y>t+1.  The  series  of  hypotheses  thus  forcedly 
starts  in  the  YES  region,  then  it  might  come  to  visit  the  mushy  region,  and 
possibly  thereafter  the  NO  region.  Each  ipt  is  required  to  be  representative,  co¬ 
vering  sufficiently  many  positive  examples;  the  last  hypothesis  <pT  is  such  that 
it  is  sufficiently  correct,  covering  no  or  few  negative  examples.  We  examine  the 
implications  of  this  search  strategy,  depending  on  the  location  of  the  target  con¬ 
cept  p. 

Case  1:  <p  belongs  to  the  phase  transition  region. 

By  construction,  <p  would  cover  a  random  example  with  probability  around  .5; 
examples  need  little  repairing  (Section  3)  in  order  to  get  evenly  distributed 
training  and  test  sets.  Hence: 

•  No  hypothesis  in  the  YES  region  can  be  correct  as  it  likely  covers  all  training 
examples.  The  search  must  go  on  until  reaching  the  mushy  region. 

•  Symmetrically,  any  hypothesis  in  the  NO  region  would  hardly  cover  any 
training  example,  hence  it  is  not  representative.  The  search  thus  should  stop  at 
the  very  beginning  of  the  NO  region,  and  preferably  before,  that  is,  in  the  mushy 
region. 

Therefore  in  this  case,  a  top-down  learner  is  bound  to  produce  hypotheses  <pT 
lying  in  the  mushy  region. 

Case  2:  <p  belongs  to  the  NO  region. 

Here,  negative  examples  do  not  need  to  be  repaired;  hence,  any  hypothesis  in  the 
YES  region  will  cover  them;  thus  the  search  must  go  on  at  least  until  reaching 
the  mushy  region.  On  the  other  hand,  any  hypothesis  in  the  NO  region  should 
be  correct,  and  there  is  no  need  to  continue  the  search.  Top-down  learning  is 
thus  bound  to  produce  hypotheses  <pT  lying  in  the  mushy  region,  or  on  the  verge 
of  the  NO  region. 

Case  3:  tp  belongs  to  the  YES  region. 

The  situation  is  different  here,  since  there  exist  correct  hypotheses  in  the  YES 
region,  namely  the  target  concept  itself,  and  possibly  many  specializations  the¬ 
reof.  Should  these  hypotheses  be  discovered  (the  chances  for  such  a  discovery 
are  discussed  in  the  next  subsection),  it  would  not  be  necessary  to  continue  the 
search.  In  any  case,  the  search  should  stop  before  reaching  the  NO  region,  for  the 
following  reason:  positive  examples  do  not  need  to  be  repaired;  any  hypothesis 
in  the  NO  region  would  cover  none  of  them.  Then,  top-down  learning  is  bound 
to  produce  hypotheses  <pT  in  the  YES  or  in  the  mushy  region. 

The  above  remarks  explain  why  the  phase  transition  constitutes  an  attractor 
for  top-down  learning. 

5.2  Correct  Identification  of  the  Target  Concept 

As  the  information  gain  relies  on  the  number  of  models  of  a  candidate  hypothe¬ 
sis.  But  any  hypothesis  in  the  YES  region  admits  many  models  in  any  random 
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example.  The  number  of  models  associated  to  any  literal  is  thus  hardly  meaning¬ 
ful,  except  when  the  current  hypothesis  is  close  to  the  target  concept.  Further, 
the  variance  in  the  number  of  models  blinds  the  selection  of  the  literals.  Com¬ 
plementary  experiments  [BGSSOO]  show  that  the  variance  reaches  its  maximum 
as  hypotheses  reach  the  phase  transition. 
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Fig.  4.  Minimum  size  of  ipt  before  the  information  gain  becomes  reliable. 


Fig.  4  reports,  at  coordinates  ( m,L ),  the  minimal  level  tm  of  the  specializa¬ 
tion  process  where  the  information  gain  becomes  reliable.  Fig.  4  could  be  thus 
interpreted  as  a  reliability  map  of  the  information  gain. 

Note  that,  for  most  problems  in  the  mushy  region  or  on  the  borderline  bet¬ 
ween  the  mushy  and  the  NO  regions,  tm  takes  high  values,  denoting  a  poor 
ability  to  find  any  correct  path;  moving  farther  away  from  the  phase  transition, 
tc  gradually  decreases. 


5.3  Good  Approximation  of  the  Target  Concept 

According  to  the  above  discussion,  relational  learning  is  doomed  to  fail  when 
either  the  size  m  of  the  target  concept  and/or  the  number  L  of  constants  in  the 
application  domain,  are  high.  Still,  when  both  m  and  L  are  high  (upper  right 
region  in  Fig.  2),  FOIL  succeeds,  and  finds  highly  accurate  hypotheses. 

This  can  be  explained  as  follows.  Let  us  assume  that  the  target  concept  <p 
belongs  to  the  NO  region. 
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We  show  that  any  generalization  £  of  the  target  concept  p  will  almost  surely 
correctly  classify  any  training  or  test  examples,  provided  that  £  belongs  to  the 
NO  region:  negative  examples  are  randomly  constructed;  hence,  any  hypothesis 
in  the  NO  region  will  be  correct;  in  particular,  £  is  correct.  On  the  other  hand, 
any  example  covered  by  99  is  also  covered  by  £;  this  implies  that  £  covers  all 
positive  examples.  Finally,  any  generalization  of  99  that  belongs  to  the  NO  region 
is  complete  and  (almost  surely)  correct. 

It  follows  that,  if  the  learning  search  happens  to  examine  a  generalization  £ 
of  ip  which  is  close  to  the  NO  region,  £  will  be  considered  an  optimal  hypothesis, 
which  will  stop  the  search.  The  success  of  relational  learning,  with  respect  to 
predictive  accuracy,  thus  depends  on  the  probability  of  finding  a  generalization 
£  of  <p  on  the  edge  of  the  phase  transition. 

6  Conclusion 

The  present  study  shed  some  light  on  the  limitations  of  several  up-to-date  rela¬ 
tional  learners.  One  major  result  of  the  paper  is  the  fact  that  the  learning  search, 
be  it  based  on  top-down  or  genetic-like  exploration,  is  trapped  in  the  mushy  re¬ 
gion.  This  result  is  supported  by  the  systematic  experiments  reported  here  and 
also  by  complementary  experiments  [GSOO]  on  real-world  applications.  A  second 
result  is  the  fact  that  there  is  a  large  “blind  spot”  in  the  concept  landscape.  Any 
concept  (p  in  this  area  could  not  be  learned  from  examples;  all  relational  lear¬ 
ners  considered  in  the  present  study  failed  to  learn  anything  better  than  random 
guess  from  the  available  examples.  This  blind  spot  reflects  the  criteria  used  to 
guide  the  search,  which  actually  mislead  it. 
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Abstract.  INGENS  is  a  prototypical  GIS  which  integrates  machine  learning 
tools  in  order  to  discover  geographic  knowledge  useful  for  the  task  of 
topographic  map  interpretation.  It  embeds  ATRE,  a  novel  learning  system  that 
can  induce  recursive  logic  theories  from  a  set  of  training  examples.  An 
application  to  the  problem  of  recognizing  four  morphological  elements  in 
topographic  maps  of  the  Apulia  region  is  also  illustrated. 


1  Introduction 

Data  stored  in  many  geographical  information  systems  (GIS)  concern  topographic 
maps,  which  show  relief,  vegetation,  hydrography  and  man-made  features  of  a  land 
portion  [4].  Some  map  management  functions  implemented  in  current  GIS  are 
storage,  retrieval  and  visualization  on  different  scales.  Nevertheless,  the  interpretation 
of  topographic  maps  is  an  equally  important  facility  which  is  rarely  supported  in  a 
GIS.  Indeed,  information  given  in  topographic  map  legends  or  in  GIS  models  is  often 
insufficient  to  recognize  geographic  objects  of  interest  for  a  given  application.  For 
example,  a  study  of  the  drawing  instruction  of  Bavarian  cadastral  maps  pointed  out 
that  symbols  for  road,  pavement,  roadside,  garden  and  so  on  were  defined  neither  in 
the  legend  nor  in  the  GIS  model  of  the  map  [8].  These  objects  require  a  process  of 
map  interpretation,  which  can  be  quite  complex  in  some  cases.  The  detection  of 
morphologies  characterizing  the  territory  described  in  a  topographic  map,  the 
selection  of  important  environmental  elements,  both  natural  and  artificial,  and  the 
recognition  of  forms  of  territorial  organization  require  abstraction  processes  and  deep 
domain  knowledge  that  only  human  experts  have.  Although  these  are  the  patterns 
which  geographers,  geologists  and  town  planners  are  interested  in,  they  are  never 
explicitly  represented  in  topographic  maps  or  in  GIS. 

In  order  to  acquire  the  necessary  knowledge  for  map  interpretation,  we  propose  to 
extend  a  GIS  with  a  training  facility  and  a  learning  capability,  so  that  each  time  a  user 
wants  to  query  its  database  on  some  geographic  objects  not  explicitly  modeled,  he/she 
can  prospectively  train  the  system  to  recognize  such  objects  and  to  create  a  special 
user  view.  Both  examples  and  counter-examples  are  provided  by  the  expert  user  by 
means  of  the  GIS  interface.  The  symbolic  representation  of  the  training  examples  is 
automatically  extracted  from  the  maps,  although  it  is  still  controlled  by  the  user  who 
can  select  a  suitable  level  of  abstraction  and/or  aggregation  of  data.  The  learning 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  40-48,  2000. 
©  Springer-Verlag  Berlin  Heidelberg  2000 


Discovering  Geographic  Knowledge:  The  INGENS  System  41 


module  of  the  information  system  implements  one  or  more  inductive  learning 
algorithms  that  can  generate  models  of  geographic  objects  from  the  chosen 
representations  of  training  examples. 

INGENS  (INductive  GEographic  iNformation  System)  is  a  prototypical  GIS 
devoted  to  manage  topographic  maps  of  the  Apulia  region  (Italy)  to  support  land 
planning.  Its  logical  architecture  is  described  in  the  next  section.  The  distinguishing 
feature  of  INGENS  is  its  inductive  learning  capability,  which  is  used  to  discover 
geographic  knowledge  of  interest  to  town  planners.  In  Section  3,  the  main 
characteristics  and  the  high  level  algorithm  of  a  novel  learning  system  currently 
embedded  in  INGENS  is  described.  This  system,  named  ATRE,  has  been  applied  to 
map  interpretation  tasks  to  locate  important  environmental  and  morphological 
concepts  on  topographic  maps.  Section  4  is  devoted  to  the  explanation  of  some 
preliminary  results.  The  paper  concludes  with  a  brief  discussion  on  future  work. 


2  INGENS  software  architecture  and  object  data  model 


The  software  architecture  of  INGENS  is  reported  in  Figure  1.  The  Map  Repository  is 
the  database  instance  that  contains  the  actual  collection  of  maps  stored  in  INGENS. 
Geographic  data  are  organized  according  to  a  hybrid  tessellation  -  topological  object- 
oriented  model.1  The  tessellation  model  follows  the  usual  topographic  practice  of 
superimposing  a  regular  grid  on  a  map  to  simplify  the  localization  process.  Indeed 
each  map  in  the  repository  is  divided  into  square  cells  of  same  size.  The  raster  image 
of  a  cell  is  stored  together  with  its  coordinates  and  component  objects.  In  the 
topological  model  of  each  cell  it  is  possible  to  distinguish  two  different  hierarchies: 
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Fig.  1.  INGENS  three-layered  software  architecture. 


1  The  object-oriented  database  management  system  (OODBMS)  used  to  store  data  is 
ObjectStore  5.0  by  Object  Design,  Inc. 
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physical  and  logical.  The  former  describes  the  geographical  objects  by  means  of  the 
most  appropriate  physical  entity,  that  is  point,  line  or  region,  while  the  latter  expresses 
the  semantics  of  geographic  objects  (hydrography,  orography,  administrative  or 
political  boundary,  and  so  on),  independently  of  their  physical  representation. 

The  Map  Storage  Subsystem  is  involved  in  storing,  updating  and  retrieving  items  to 
and  from  the  map  repository.  As  resource  manager,  it  represents  the  only  access  path 
to  the  data  contained  in  the  repository  by  multiple,  concurrent  clients. 

The  layer  of  the  application  enablers  makes  several  functionalities  available  to  the 
different  users  of  the  system.  Users  are  classified  in  four  categories: 

•  Administrators,  who  are  responsible  for  GIS  management. 

•  Map  maintenance  users,  whose  main  task  is  updating  the  Repository. 

•  Sophisticated  end  users,  who  can  train  the  system  to  learn  operational  definitions 
of  geographic  objects  not  explicitly  modeled  in  the  database. 

•  Casual  end  users,  who  occasionally  access  the  database  and  may  need  different 
information  each  time.  Casual  users  cannot  train  INGENS. 

The  Map  Converter  is  a  suite  of  tools  which  support  the  acquisition  of  maps  from 
external  sources,  namely  raster  images  from  scanners  and  geographic  objects  from 
files  of  maps  in  a  proprietary  vector  format.  Currently,  INGENS  can  automatically 
acquire  information  from  vector  maps  in  the  MAP87  format  defined  by  the  Italian 
Military  Geographic  Institute  (IGMI)  (http://www.nettuno.it/fiera/igmi/igmit.htm).  Since 
these  maps  contain  static  information  on  orographic,  hydrographic  and  administrative 
boundaries  alone,  a  Map  Editor  is  required  in  order  to  integrate  and/or  modify  this 
information.  The  Map  Descriptor  is  the  application  enabler  responsible  for  the 
automated  generation  of  first-order  logic  descriptions  of  geographic  objects.  The 
Learning  Server  provides  a  suite  of  learning  systems  that  can  be  run  by  multiple  users 
to  train  INGENS.  Currently,  two  inductive  learning  systems  are  available  in  the  suite: 
INDUBI/CSL  [6]  and  ATRE  [7],  Both  systems  can  induce  first-order  logic 
descriptions  of  some  concepts  from  a  set  of  training  examples.  Nevertheless,  they 
adopt  different  generalization  models  and  search  strategies,  so  that  they  can  induce 
different  descriptions  from  the  same  set  of  examples.  The  system  ATRE  is  described 
in  the  next  section.  The  Query  Interpreter  is  the  inference  engine  that  allows  any  user 
to  formulate  a  query  in  a  first-order  logic  language.  The  query  can  contain  both  spatial 
and  aspatial  descriptors  that  can  be  automatically  generated  by  the  Map  Descriptor,  as 
well  as  new  descriptors  whose  operational  description  has  already  been  learned. 

The  interface  layer  implements  a  Graphical  User  Interface  (GUI),  which  allows 
the  four  categories  of  INGENS  users  to  create/  maintain/delete  a  repository  of  maps, 
train  the  system  to  learn  operational  definitions  of  some  geographic  concepts,  choose 
a  specific  map  repository  and  query/browse  it  on  the  basis  of  the  content  of  its  maps. 


3  Learning  classification  rules  for  geographical  objects 

Sophisticated  end  users  may  train  INGENS  in  order  to  learn  operational  definitions  of 
geographical  objects  that  are  not  explicitly  modeled  in  the  database.  The  system 
ATRE,  which  is  presented  in  this  section,  can  induce  recursive  logical  theories  and 
can  autonomously  discover  concept  dependencies,  the  latter  being  an  important  issue 
for  many  map  interpretation  problems. 
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Here  the  term  logical  theory  (or  simply  theory)  denotes  a  set  of  first-order  definite 
clauses.  An  example  of  a  logical  theory  is  the  following: 
downtown(X)  <—high_business_activity(X),  onthesea(X). 
residential(X)  <—  close_to(X,Y),  downtown(Y),  low_business_activity(X). 
residential(X)  <—  close _to(X,Y),  residential(Y),  low_business_activity(X). 

It  expresses  sufficient  conditions  for  the  two  concepts  of  “main  business  center  of  a 
city”  and  “residential  zone,”  which  are  represented  by  the  unary  predicates  downtown 
and  residential,  respectively. 

The  learning  problem  solved  by  ATRE  can  be  formulated  as  follows: 

Given 

•  a  set  of  concepts  C,,  Cv  ....  Cr  to  be  learned, 

•  a  set  of  observations  O  described  in  a  language  L0, 

•  a  background  knowledge  BK  described  in  a  language  LBK, 

•  a  language  of  hypotheses  L„, 

•  a  generalization  model  /"over  the  space  of  hypotheses, 

•  a  user’s  preference  criterion  PC, 

Find 

a  (possibly  recursive)  logical  theory  T  for  the  concepts  C,,  C2,  . . Cr,  such  that  T  is 
complete  and  consistent  with  respect  to  O  and  satisfies  the  preference  criterion  PC. 

The  completeness  property  holds  when  the  theory  T  explains  all  observations  in  O 
of  the  r  concepts  Cf,  while  the  consistency  property  holds  when  the  theory  T  explains 
no  counter-example  in  O  of  any  concept  C.  The  satisfaction  of  these  properties 
guarantees  the  correctness  of  the  induced  theory  with  respect  to  O. 

As  to  the  representation  languages  La,  LBK,  L,„  the  basic  component  is  the  literal, 
which  takes  two  distinct  forms: 

f(tlt  ...,  tj  =  Value  {simple  literal)  f  tr  ...,  tj  e  [a..bj  {set  literal), 

where/and  g  are  function  symbols  called  descriptors,  t's  and  s.’s  are  terms,  and  [a..b] 
is  a  closed  interval.  Descriptors  can  be  either  nominal  or  linear,  according  to  the 
ordering  relation  defined  on  its  domain  values.  Some  examples  of  literals  are: 
color(X)=blue,  distance(X,Y)=63.9,  width(X)e[82.2  ..  83.1],  and  close Jo(X,  Y)=true. 

The  last  example  shows  the  lack  of  predicate  symbols  in  the  representation 
languages  adopted  by  ATRE.  Indeed,  ATRE  can  deal  with  classical  negation,  but 
not  with  negation  by  failure,  not  [5].  Thus,  the  first-order  literals  p(X,Y)  and  -i p(X,Y) 
are  represented  as  fp(X,Y)=true  and  fp(X,Y)=false,  respectively,  where  fp  is  the  function 
symbol  associated  to  the  predicate  p.  Henceforth,  for  the  sake  of  simplicity,  we  will 
adopt  the  usual  notation p(X,Y)  and  p(X,Y). 

The  language  of  observations  L0  is  object-centered,  meaning  that  observations  are 
represented  as  ground  multiple-head  clauses,  called  objects,  with  a  conjunction  of 
simple  literals  in  the  head.  An  instance  of  an  object  is  the  following: 

downtown] zone,)  A  residential(zone2)  <—  close_to(zone,,  zone2),  onthesea(zone ,), 
highjmsiness_activity( zone,),  low_business_activity( zone,). 
which  is  semantically  equivalent  to  the  set  of  definite  clauses: 
downtown] zone,)  <—  close_to(zone,,  zone,),  onthesea(zone,), 

high_business_activity( zone,),  lowJbusiness_activity( zone,). 
residential(zone2)  <—  close_to(zoner  zone,),  onthesea(zone,), 

high_business_activity( zone,),  lowJbusiness_activity( zonej. 
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Multiple-head  clauses  are  peculiar  to  ATRE  and  present  two  main  advantages  with 
respect  to  definite  clauses:  higher  comprehensibility  and  efficiency.  The  former  is 
basically  due  to  the  fact  that  multiple-head  clauses  provide  us  with  a  compact 
description  of  multiple  properties  to  be  predicted  in  a  complex  object  like  those  we 
may  have  in  map  interpretation.  The  second  advantage  is  the  possibility  to  have  a 
unique  representation  of  known  properties  shared  by  a  subset  of  observations.  In  fact, 
ATRE  distinguishes  objects  from  examples,  which  are  described  as  pairs  <H,  OID> 
where  H  is  a  literal  in  the  head  of  the  object  indicated  by  the  object  identifier  OID. 
Examples  can  be  considered  as  positive  or  negative,  according  to  the  concept  to  be 
learned.  For  instance,  {downtownfzoneJ-true.OJ  is  a  positive  example  of  the  concept 
downtown(X)=true,  a  negative  example  of  the  concept  downtown(X)=false,  and  it  is 
neither  a  positive  nor  a  negative  example  of  the  concept  residential(X)-true. 

The  language  of  hypotheses  Lu  is  that  of  linked,  range-restricted  definite  clauses 
[2]  with  simple  and  set  literals  in  the  body  and  one  simple  literal  in  the  head.  The 
interval  [a..b]  in  a  set  literal  ffXp  XJ  e  [a..b]  is  computed  according  to  the  same 
criterion  used  in  INDUBI/CSL  [6].  Some  examples  of  clauses  induced  by  ATRE  are 
given  in  the  next  section. 

The  background  knowledge  defines  any  relevant  domain  knowledge.  It  is 
expressed  in  a  language  LBK  with  the  same  constraints  as  the  language  of  hypotheses. 
The  following  is  an  example  of  spatial  background  knowledge: 

close_to(X,Y)  <—  adjacent(X,Y) 
which  states  that  two  adjacent  zones  are  also  close. 

Theories  generated  by  ATRE  can  be  easily  translated  into  sets  of  Datalog  definite 
clauses  with  built-in  predicates  [1],  thus  allowing  to  extend  notions  and  properties  of 
standard  first-order  logics  (e.g.,  resolution)  to  ATRE  definite  clauses. 

Regardless  of  the  representation  language  adopted,  a  key  part  of  the  induction 
process  is  the  search  through  a  space  of  hypotheses.  A  generalization  model  provides 
a  basis  for  organizing  this  search  space,  since  it  establishes  when  a  hypothesis 
explains  a  positive/negative  example  and  when  a  hypothesis  is  more  general/specific 
than  another.  A  novel  generalization  model,  named  generalized  implication  [7],  is 
adopted  by  ATRE. 

The  main  learning  procedure  is  shown  in  Figure  2.  To  illustrate  the  algorithm,  let 
us  consider  the  following  input  data: 


Objects  O,  downlownfzone Ja  -iresidentialfzone  Ja  residential(zone2)  a 

-downtown(zone2)  /\  —downtownf zone,)  a  residential(zone4)  a 
-downtownf zonej  a  -downtownf zone,)  a  -ire sidentialf zonej  a 
-ire  sidentialf  zone J  a  downtownf  zone,)  a  ^residential)  zone,)  <— 
onthesea(zonel),high_business_activity(zonel),  close_to( zone,, zone,), 
low_business_activityf zone,), closejtofzone,, zonej,  adjacentf zone,, zone,), 
ontheseaf  zone J,  low  _business_activity(  zone,),  low_business_activity( zonej, 
close _to(zone4, zonej,  high_business_activity(zone5),adjacent(zones, zonej, 
low_business_activity(  zonej,  close  _to(  zone y  zonej,  low_business_activity(  zonej, 
close_to( zone,, zone,),  ontheseaf  zone,),  high_business_activity(zoneJ. 


BK  close  Jo(X,Y)  <— adjacent(X,Y) 

close Jof X,  Y)  <—  closejof  Y,X) 

Concepts  C,  downtown(X)=true 

C2  residentialjzone(X)=true 

PC  Minimize/maximize  negative/positive  examples  explained  by  the  theory. 
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procedure  team _recursive_theory( Objects,  BK,  {C. . CJ,  PC) 

SatObjects  :=  saturate _objects(  Objects,  BK) 

Examples  :=  generate _pos_and_neg_examples(  Objects,  (C,,...,CJ ) 

LearnedTheorv  :=  0 
Concepts  .  =  /C,,...,CJ 

repeat 

ConsistentClauses  :=  parallel_conquer(  Concepts,  Examples,  PC ) 

Clause  :  =  fmd_best_clause  ( Consistent jclauses,  PC  ) 

ConsistentTheory  —verify _global_consistence(  Clause,  Learned jheory,  Objects,  Examples) 
LearnedTheorv  :=  ConsistentTheory  lx /Clause/ 

Objects  .  =  saturate_objects( SatObjects,  LeamedTheory) 

Examples  .  =  update _examples( LearnedTheorv,  Examples) 
foreach  C.  in  Concepts  do 

if  pos_examples(C,  )=  0then  Concepts  .  =  Concepts  /  (CJ 
endif 

endforeach 
until  Concepts  =  0 
return  LeamedTheory 


Fig.  2.  ATRE:  Main  procedure. 

The  first  step  towards  the  generation  of  inductive  hypotheses  is  the  saturation  of 
all  objects  with  respect  to  the  given  BK  [9].  In  this  way,  information  that  was  implicit 
in  the  object,  given  the  background  knowledge,  is  made  explicit.  In  the  above 
example,  the  saturation  of  O,  involves  the  addition  of  the  nine  literals  logically 
entailed  by  BK,  that  is:  close_to(zone2,  zone,),  close_to(zone,,  zone3),  close_to(zone3, 
zone,),  close  Jo/zone  T  zone,),  close_to(zone4,  zone2),  close_to(zone3,  zonej, 
close_to(zones,  zone6),  close_to( zonev  zone5),  and  close_to( zones,  zone6). 

Initially,  all  positive  and  negative  examples  (pairs  ( L,OID })  are  generated  for  every 
concept  to  be  learned,  the  learned  theory  is  empty,  while  the  set  of  concepts  to  be 
learned  contains  all  C,.  With  reference  to  the  above  input  data,  the  system  generates 
two  positive  examples  for  C,  {downtown/ zone,)  and  downtown/ zone 7)),  two  positive 
examples  for  C2  (residential/ zone2)  and  residential/ zonej),  and  eight  negative 
examples  equally  distributed  between  C,  and  C2  (—downtown/ zonej, 
—downtown/ zonej,  -downtown/ zonej  a  —downtown/ zonej,  —iresidential/zone,), 
-^residential/ zonej,  —^residential/ zonej,  -^residential/ zonej  ). 

The  procedure  parallel_conquer  generates  a  set  of  consistent  clauses,  whose 
minimum  number  is  defined  by  the  user.  For  instance,  by  requiring  the  generation  of 
at  least  one  consistent  clause  with  respect  to  the  above  examples,  this  procedure 
returns  the  following  set  of  clauses: 

downtown/X)  <—  onthesea/X),  high_business_activity(X). 
downtown/X)  <—  onthesea/X),  adjacent(X,Y). 
downtown/X)  <—  adjacent(X,Y),  onthesea/Y). 

The  first  of  these  is  selected  according  to  the  preference  criterion  (procedure 
fmd_best_clause).  In  fact,  the  hypothesis  space  of  the  concept  residential  has  been 
simultaneously  explored,  just  when  the  three  consistent  clauses  for  the  concept 
downtown  have  been  found,  no  consistent  clause  for  residential  has  been  discovered 
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yet.  Thus,  the  parallel_conquer  procedure  stops,  since  the  number  of  consistent 
clauses  is  greater  than  one. 

Since  the  addition  of  a  consistent  clause  to  the  partially  learned  theory  may  lead  to 
an  augmented,  inconsistent  theory,  the  procedure  verify _global_consistence  makes 
necessary  checks  and  possibly  reformulates  the  theory  in  order  to  recover  the 
consistency  property  without  repeating  the  learning  process  from  scratch.  The 
reformulation  is  based  on  the  layering  technique,  which  is  peculiar  to  ATRE.  The 
learned  clause  is  used  to  resaturate  the  object.  Continuing  the  previous  example,  the 
two  literals  added  to  O,  are  downtown( zone)  and  downtown(zone7).  This  operation 
enables  ATRE  to  generate  also  definitions  of  the  concept  residential  that  depend  on 
the  concept  downtown.  Indeed,  at  the  second  iteration  the  procedure  paralleljconquer 
returns  the  clause: 

residential(X)  <—close_to(X,Y),  downtown(Y),  low_business_activity(X). 
and  by  resaturating  the  object  with  both  learned  clauses,  it  becomes  possible  to 
generate  a  recursive  clause  at  the  third  iteration,  namely: 

residential(X)  <—  close_to(X,Y),  residential(Y),  low_business_activity(X). 

At  the  end  of  each  iteration,  the  procedure  update_examples  tags  positive  examples 
explained  by  the  current  learned  theory,  so  that  they  are  no  longer  considered  for  the 
generation  of  new  clauses.  The  loop  terminates  when  all  positive  examples  are 
tagged,  meaning  that  the  learned  theory  is  complete  and  consistent. 


4.  Application  to  Apulian  map  interpretation 

INGENS  has  been  applied  to  the  recognition  of  four  morphological  elements  in 
topographic  maps  of  the  Apulia  region  (Italy),  namely  regular  grid  system  of  farms, 
fluvial  landscape,  system  of  cliffs  and  royal  cattle  track.  Such  elements  are  deemed 
relevant  for  the  environmental  protection,  and  are  of  interest  to  town  planners.  A 
regular  grid  system  of  farms  is  a  particular  model  of  rural  space  organization  that 
originated  from  the  process  of  rural  transformation.  The  fluvial  landscape  is 
characterized  by  the  presence  of  waterways,  fluvial  islands  and  embankments.  The 
system  of  cliffs  presents  a  number  of  terrace  slopes  with  the  emergence  of  blocks  of 
limestone.  A  royal  cattle  track  is  a  road  for  the  transhumance  that  can  be  found 
exclusively  in  the  South-Eastern  part  of  Italy. 

The  considered  territory  covers  131  km2  in  the  surroundings  of  the  Ofanto  River, 
spanning  from  the  zone  of  Canosa  until  the  Ofanto  mouth.  The  examined  area  is 
covered  by  five  map  sheets  on  a  scale  of  1:25000  produced  by  the  IGMI.  The 
gridding  step  chosen  for  the  segmentation  of  the  territory  delimits,  for  each  map, 
square  observation  units  of  1  Km2  each.  This  is  the  same  gridding  system 
superimposed  over  IGMI  topographic  chart  on  a  scale  of  1 :25000.  Thus  there  is  a  one- 
to-one  mapping  between  observation  units  in  the  chart  and  single  cells  in  the 
database.  Each  cell  has  to  be  described  in  the  logic  formalism  of  ATRE  objects. 

First-order  logic  descriptions  of  the  maps  are  generated  by  applying  algorithms 
derived  from  geometrical,  topological,  and  topographical  reasoning.  Since  descriptors 
are  quite  general  they  can  also  be  used  to  describe  maps  on  different  scales.  A  partial 
description  of  a  cell  containing  fifty-two  distinct  objects  is  given  in  Figure  3.  The 
whole  description  is  a  clause  with  three  hundred  and  forty  literals  in  the  body. 
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class(xl)=other  <— 

contain(xl,x2)=trae,  contain(xl,x3)=true,  ....  contain(xl,x53)=true, 
type_of(x2)=canal_line,  type_of(x3)=vegetation,  ....  type_of(x53)=vegetation, 
color(x2)=blue,  color(x3)=black,...,  color(x53)=black,  trend(x2)=straight, 
trend(x8)=straight,  trend(x51)=curvilinear,  extension(x2)=l 84.057, 

extcnsion(x8)=1 70.074,  ...,  extension(x51)=982.207, 
geographic_direction(x2)=north_west,  geographic_direction(x8)=north_est, . . 
geographic_direction(x49)=north_est,  shape(xl6)=cuspidal,  shape(xl8)=cuspidal, 
shape(x44)=cuspidal,  density  (x4)=low,  density(x6)=low, density(x52)=low, 
relation(x8,xl0)=almost_parallel,  relation(x8,xl4)=almost_parallel, 
relation(x46,x49)=almost_parallel,  distance(x8,xl0)=463.09, 
distance(x8,xl4)=423.111,  distance(x46,x49)=477.322 


Fig.  3.  Partial  logical  description  of  a  cell.  Constant  xl  represents  the  whole  cell,  while  all 
other  constants  denote  the  fifty-two  enclosed  geographic  objects.  Distances  and  extensions 
are  expressed  in  meters. 

Then  the  problem  of  recognizing  the  four  morphological  elements  can  be 
reformulated  as  the  problem  of  labeling  each  cell  with  at  most  one  of  four  labels. 
Unlabelled  cells  are  considered  uninteresting  with  respect  to  the  goal  of 
environmental  protection.  Globally  131  cells  were  selected,  each  of  which  was 
assigned  to  one  of  the  following  five  classes:  system  of  farms,  fluvial  landscape, 
system  of  cliffs,  royal  cattle  track  and  other.  The  last  class  simply  represents  “the  rest 
of  the  world,”  and  no  classification  rule  is  generated  for  it.  Indeed,  the  cells  assigned 
to  it  are  not  interesting  with  respect  to  the  problem  of  environmental  protection  under 
study,  and  they  are  always  used  as  negative  examples  when  ATRE  learns 
classification  rules  for  the  remaining  classes.  Forty-five  cells  from  the  map  of  Canosa 
were  selected  to  train  the  system,  while  the  remaining  eighty-six  cells  were  randomly 
selected  from  the  other  four  maps.  The  preference  criterion  PC  maximizes  both  the 
number  of  explained  positive  examples  and  the  number  of  clause  literals. 

A  fragment  of  the  logical  theory  induced  by  ATRE  is  reported  in  the  following: 
class(Xl)  =  fluvial Jandscape  <—  contain(Xl ,X2),  color(X2)=blue, 

typepf(X2)=river,trend(X2)=curvilinear,  extension(X2)e[325. 00..  818.00]. 
class(Xl)  =  fluvialjandscape  <—  contain(Xl ,X2),  typepf(X2)=river,  color(X2)=blue, 
relation(X3,X2)=almost perpendicular,  extension(X2)  e[61 5. 16..7 12.37 ], 
trend(X3  )-straight. 

class! X 1  )-systempf Jarms  <—  contain! X 1  ,X2),  color(X2)=black, 

relation(X2,X3  )=almost perpendicular,  relalion(X3,X4)=almost parallel, 
typepf(X4)=interfarmpoad,  geographic_direction(X4)=north_est, 
extension(X2)e[362.34 ..  712.25],  color(X3)=black, 
typepf(X3)=farmpoad,  color(X4)=black,  trend(X2)= straight. 

The  first  two  clauses  explain  all  training  observations  of  a  fluvial  landscape,  while 
the  third  clause  is  a  partial  description  of  the  concept  system_of Jarms. 

In  order  to  test  the  accuracy  of  the  induced  theory,  the  Query  Interpreter  was 
provided  with  both  the  eighty-six  observations,  reserved  for  the  test  phase,  and  the 
theory  itself.  Test  cells  were  recognized  with  a  predictive  accuracy  of  over  95%. 
These  results  are  promising,  although  they  are  affected  by  the  careful  selection  of 
both  a  suitable  representation  of  observations  and  a  training  set.  Results  of  a  previous 
experiment  on  a  smaller  scale  map  of  the  same  region  (1:50000)  are  reported  in  [3]. 
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5.  Conclusions  and  future  work 

Knowledge  of  the  meaning  of  symbols  listed  in  map  legends  is  not  generally 
sufficient  to  recognize  interesting  geographic  patterns  on  a  topographic  map,  so  that 
GIS  users  are  asked  to  formulate  quite  complex  queries  to  describe  such  patterns.  In 
fact,  these  user  queries  are  operational  definitions  of  abstract  concepts  often  reported 
in  specialist  texts  and  handbooks.  To  support  GIS  users  in  their  activity  a  new 
approach  has  been  proposed  in  this  paper.  The  idea  is  asking  users  a  set  of  classified 
instances  of  the  patterns  of  their  interest,  and  then  applying  machine  learning  tools 
and  techniques  to  generate  the  operational  definitions  for  such  patterns.  These 
definitions  can  be  subsequently  used  to  search  for  new  instances  not  in  the  training 
set,  or  to  facilitate  the  formulation  of  a  query.  INGENS  is  a  prototypical  GIS  with 
learning  capabilities  that  has  been  designed  and  implemented  to  provide  users  with  a 
training  facility.  An  application  of  the  system  to  the  problem  of  Apulian  map 
interpretation  has  been  briefly  described,  and  preliminary  experimental  results  are 
presented.  The  learning  system  used  in  this  application  is  ATRE,  whose  innovative 
features  have  been  briefly  explained. 

INGENS  can  be  extended  in  various  directions.  Currently,  a  set  of  generalization 
and  abstraction  operators  has  been  implemented  to  provide  the  user  with  some  tools 
that  simplify  the  complex  descriptions  produced  by  the  Map  Descriptor.  These 
operators  are  similar  to  those  commonly  used  in  on-line  analytical  processing  (OLAP) 
tools.  For  the  future,  we  plan  to  embed  a  system  for  the  discovery  of  spatial 
association  rules  in  the  Learning  Server. 
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Abstract  Data  mining,  often  called  knowledge  discovery  in  databases  (KDD), 
aims  at  semiautomatic  tools  for  the  analysis  of  large  data  sets.  This  report  is  first 
intended  to  serve  as  a  timely  overview  of  a  rapidly  emerging  area  of  research, 
called  temporal  data  mining  (that  is,  data  mining  from  temporal  databases  and/or 
discrete  time  series).  We  in  particular  provide  a  general  overview  of  temporal 
data  mining,  motivating  the  importance  of  problems  in  this  area,  which  include 
formulations  of  the  basic  categories  of  temporal  data  mining  methods,  models, 
techniques  and  some  other  related  areas.  This  report  also  outlines  a  general  fra¬ 
mework  for  analysing  discrete  time  series  databases,  based  on  hidden  periodicity 
analysis,  and  presents  the  preliminary  results  of  our  experiments  on  the  exchange 
rate  data  between  US  dollar  and  Canadian  dollar. 

Keywords:  data  mining,  temporal  databases,  temporal  data  analysis,  time  series, 
statistical  theory,  hidden  periodicity  analysis. 


1  Introduction 

Data  Mining,  also  known  as  Knowledge  Discovery  in  Databases(KDD),  aims  at  semiau¬ 
tomatic  tools  for  the  analysis  of  large,  realistic  data  sets.  It  is  a  rapidly  evolving  area  of 
research,  that  is  at  intersection  of  several  disciplines,  including  statistics,  pattern  reco¬ 
gnition,  databases,  optimization,  visualization  and  high-performance  computing.  There 
are  some  very  important  and  challenging  research  problems  in  data  mining,  for  instance, 
the  application  of  data  mining  techniques  and  tools  to  different  types  of  databases  such 
as  temporal  databases,  spatial  databases  and  so  on.  This  paper  focusses  on  issues  and 
challenges  for  mining  temporal  databases. 

Temporal  data  mining  is  concerned  with  discovering  qualitative  and  quantitative 
temporal  patterns  in  a  temporal  database  or  in  a  discrete- valued  time  series  (DTS)  dataset. 
Recently,  there  has  been  special  attention  to  two  kinds  of  major  problems  in  the  literature: 
similarity  problem  and  periodicity  problem.  Although  there  are  various  results  to  date 
on  discovering  periodic  patterns  and  similarity  patterns  in  discrete-valued  time  series 
(DTS)  datasets  (e.g.  [2]),  a  general  theory  of  discovering  patterns  for  DTS  data  analysis 
is  not  well  known. 
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In  this  article,  we  first  provide  an  overview  of  temporal  data  mining  (TDM),  define 
some  of  the  key  ideas,  identify  a  variety  of  challenging  problems,  both  in  the  theory  and 
the  systems,  and  motivate  their  importance.  We  then  propose  a  general  framework  for 
analysing  a  DTS  and  then  focus  on  the  special  problem  of  discovering  patterns  using 
hidden  periodicity  analysis. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  starts  discussion  with  temporal 
databases  (in  particular  time-series  databases)  and  then  moves  onto  current  issues  in 
temporal  data  mining.  Section  3  provides  a  few  definitions  for  temporal  patterns,  and  the 
theory  and  methods  of  hidden  periodicites  analysis.  Section  4  moves  onto  experimental 
analysis  for  pattern  discovery  on  US  dollar  versus  Canadian  dollar  exchange  rates.  The 
paper  concludes  with  a  brief  summary. 


2  An  Overview  of  Temporal  Data  Mining 

2.1  What  Is  a  Temporal  Database? 

Time  is  an  important  aspect  of  real  world  phenomena.  Conventional  databases  model  an 
enterprise  as  it  changes  dynamically  by  a  snapshot  at  particular  points  in  time.  Traditional 
databases  store  only  the  current  state  of  characteristic  of  the  data,  so  when  new  data 
become  valid,  old  ones  are  overwritten  (or  lost).  But  in  many  situations,  this  kind  of 
databases  is  inadequate.  They  can  not  easily  handle  historical  queries,  because  they  are 
not  designed  to  model  the  way  in  which  the  entities  represented  in  the  database  change 
over  time. 

Due  to  the  importance  of  time- varying  data,  efforts  have  been  made  to  design  Tem¬ 
poral  Databases  (TDB)  which  support  some  aspect  of  time  such  as  valid  time,  logical 
time  and  transaction  time  [3].  TDBs  are  able  to  overcome  this  limitation  of  traditional 
databases  by  not  overwriting  attribute  data  information,  but  instead  storing  valid  time 
ranges  with  them,  which  can  be  used  to  determine  their  validity  at  particular  times, 
including  the  present.  There  are  numerous  time  concepts  proposed  to  date  for  storing 
information  in  temporal  databases  such  as:  valid  time,  denoting  the  time  a  fact  was  true 
in  reality  and  transaction  time,  representing  the  time  the  information  was  entered  into 
the  database.  In  addition  to  these  two  concepts  which  are  of  general  interest,  there  are 
also  user-defined  time  (time  fields  in  a  traditional  database),  decision  time,  absolute  time 
and  relative  time. 

These  kinds  of  time  induce  different  types  of  databases.  A  traditional  database  sup¬ 
porting  neither  vaild  nor  transaction  time  is  termed  a  snapshot  database,  since  it  contains 
only  a  snapshot  of  the  real  world.  A  valid-time  database  contains  the  entire  history  of  the 
enterprise,  as  is  best  known  now.  A  transaction-time  database  supports  transaction-time 
and  hence  allows  rolling  back  the  database  to  a  previous  state.  We  adopt  the  definition 
of  Temporal  Database  provided  by  Tansel  et  al  [4]  as  follows: 

Definition  1  A  Temporal  Database  (TDB)  is  real  world  database  that  maintains  past, 
present,  and  future  data. 

A  TDB  model  may  support  one  or  more  of  the  time  concepts.  There  has  been  a 
great  deal  of  interest  in  temporal  databases  over  the  last  decade  with  the  number  of 
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papers  published  in  the  area  rising  steadily.  There  are  numerous  models  for  temporal 
databases  which  have  been  designed  using  both  object  oriented  and  relational  database 
as  the  underlying  database  models.  The  basic  understanding  of  temporal  databases  has 
progressed  to  the  point  where  a  standard  temporal  language  and  infrastructure  have  been 
proposed  [5] .  Most  of  the  work  done  to  date  has  identified  the  basic  properties  of  temporal 
information  and  various  data  models  and  associated  algebras  and  query  languages  have 
been  produced  in  order  to  manipulate  data  with  a  temporal  component. 

Although  research  in  temporal  databases  is  now  quite  mature,  the  development  of 
a  general  purpose  temporal  data  mining  system  still  remains  in  its  infancy.  According 
to  temporal  characteristics,  objects  in  temporal  databases  can  be  classified  into  three 
categories  [6]:  (l)Time-invariant  objects,  (2)  Time-varying  objects,  and  (3)  Time-series 
objects. 

In  rest  of  the  section,  we  focus  only  on  temporal  databases  for  time-series  objects, 
which  are  often  called  time-related  databases.  Time-related  databases  are  of  growing  im¬ 
portance  in  many  modern  database  applications,  such  as  data  mining,  data  warehousing 
and  so  on. 


2.2  Temporal  Data  Mining 

Temporal  data  mining  is  to  perform  time  series  analysis  on  the  information  held  in  a 
temporal  database.  Statistical  methods  provide  a  natural  way  of  analysing  time-related 
information  in  a  temporal  database. 

Definition  2  Temporal  Data  Mining  deals  with  problems  of  knowledge  discovery  from 
large  Temporal  Databases. 

A  relevant  and  important  question  is  how  to  apply  data  mining  techniques  on  a 
temporal  database  and  how  to  interpret  the  results.  For  instance,  sequential/temporal 
patterns  are  mined  to  analyse  a  collection  (subset)  of  records  over  periods  of  different 
variables/or  time  as  whole  records  (set)  of  variables/or  time.  Few  sequential/temporal 
techniques  have  been  developed,  based  on  Discrete  Fourier  Transformation,  to  map  a 
time  sequence  to  frequency  domain.  Other  techniques  used  in  the  discovery  of  sequen¬ 
tial/temporal  patterns  include  dynamic  time  wrapping,  neural  networks  and  rough  sets. 

According  to  techniques  of  data  mining  and  theory  of  statistical  time  series  analysis, 
the  theory  of  temporal  data  mining  may  involve  following  areas  of  investigation: 

1.  Temporal  data  mining  tasks  include: 

-  Temporal  data  characterization  and  comparison, 

-  Temporal  clustering  analysis, 

-  Temporal  classification, 

-  Temporal  association  rules, 

-  Temporal  pattern  analysis, 

-  Temporal  prediction  and  trend  analysis. 

2.  A  new  temporal  data  model  may  need  to  be  developed  based  on: 

-  Temporal  data  structures, 

-  Temporal  semantics,  or 
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3.  A  new  temporal  data  mining  concept  may  need  to  be  developed  based  on  the  follo¬ 
wing: 

-  the  task  of  temporal  data  mining  can  be  seen  as  a  problem  of  extracting  an 
interesting  part  of  the  logical  theory  of  a  model,  and 

-  the  theory  of  a  model  may  be  formulated  in  a  logical  formalism  able  to  express 
quantitative  knowledge  and  approximate  truth. 

There  are  two  kinds  of  problems  that  have  been  studied  in  temporal  data  mining  area 
in  recent  years:  (1)  the  similarity  problem,  that  is,  finding  a  time  sequence  (or  TDB)  or 
subtime  sequence  similar  to  a  given  sequence  (or  query)  or  finding  all  pairs  of  similar 
subtime  sequences  in  the  time  sequence;  and  (2)  the  periodical  problem,  that  is,  finding 
periodic  patterns  in  TDB. 

Similarity  Problems.  In  data  mining  applications,  it  is  often  necessary  to  search  within 
a  time  series  database  (e.g,TDB)  for  those  series  that  are  similar  to  a  given  query  series. 
This  kind  of  a  question  is  part  of  the  general  problem  often  called  the  similarity  search 
problem.  The  Similarity  Search  Problem  in  time-series  objects  TDBs  is  to  find  out  how 
many  of  them  are  similar  to  one  another  (or  to  compare  with  a  given  series)  within  the 
same  or  between  different  time-series  set(s)  which  may  be  one-dimensional  or  multi¬ 
dimensional. 

There  are  two  main  categories  for  similarity  problems  in  time-series  objects  TDBs: 

-  All-Occurrences  Sub-Sequence  Matching  (AOSM):  given  a  query  series  Q  of  length 
n  and  a  TDB  with  length  N(N  >  n),  find  all  occurrences  of  a  contiguous  subse¬ 
quence  within  the  TDB  that  matches  Q  approximately.  The  matching  is  under  the 
condition  that  the  query  series  Q  is  small,  and  we  look  for  a  subset  of  the  TDB  that 
best  matches  the  query  Q. 

-  All-Occurrences  Whole-Sequence  Matching  (AOWSM):  given  a  query  series  Q  of 
length  n  and  a  set  of  number  N  of  data  sequences  (TDBs)  with  the  same  length 
n,  find  all  occurrences  of  the  TDBs  that  match  Q  approximately.  The  matching  is 
under  the  condition  that  the  sequences  to  be  compared  have  the  same  length  and  we 
look  for  the  TDB  that  match  the  query  sequence  Q. 

Many  data  mining  techniques  have  been  applied  in  similarity  problems  such  as 
classification,  regression  and  clustering/segmentation.  The  main  steps  for  solving  the 
similarity  problem  are  as  follows: 

-  define  similarity:  this  step  allows  us  to  find  similarities  between  sequences  with 
different  scaling  factors  and  baseline  values. 

-  choose  a  query  sequence:  this  step  allows  us  to  find  what  we  want  to  know  from 
large  sequences  (e.g,  characteristic,  classification) 

-  processing  algorithms  for  TDB:  this  step  allows  us  to  use  some  statistical  methods 
(e.g,  transformation,  wavelet  analysis)  on  a  TDB  to  remove  noisy  data,  and  interpo¬ 
late  missing  data. 

-  processing  an  approximate  algorithm:  this  step  allows  us  to  build  up  the  classification 
scheme  for  a  time-series  TBD  according  to  the  definition  of  similarity  by  using  some 
data  mining  techniques  (e.g,  visualisation). 
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In  search  for  similarity,  a  given  query  series  may  have  some  different  types  of  mat¬ 
ching,  such  as  full  match,  match  with  shift,  match  with  scaling,  match  with  combination 
of  scaling  and  shifting,  approximate  match  and  so  on.  Also  lots  of  techniques  in  these 
areas  have  made  use  of  statistical  analysis  theory  such  as  wavelet  analysis,  multi-fourier 
analysis  and  various  statistical  transformations.  Some  results  of  similarity  problems  and 
case  studies  have  been  published  in  the  literature(e.g,  [7]).  The  results  from  a  similarity 
search  in  a  time-series  TDB  can  be  used  for  association,  prediction,  and  so  on. 


Periodical  Problems.  The  periodic  problem  involves  finding  periodic  patterns  or,  cycli¬ 
city  occurring  in  a  time-series  TDB.  The  problem  is  related  to  two  concepts:  pattern  and 
interval.  In  any  selected  sequence  of  a  TDB,  we  are  interested  in  finding  the  patterns  that 
repeat  over  time  and  their  recurring  intervals  (period),  or  finding  the  repeating  patterns 
of  a  sequence  (or  TDB)  as  well  as  the  interval  which  corresponds  to  the  pattern  period. 
The  basic  variations  of  preiodical  problems  include:  value-based,  trend-base,  partial 
pattern  and  complete  pattern  problems.  There  are  two  main  categories  for  periodical 
problems: 

1 .  Fixed  Period  Periodicity  Search:  This  kind  of  a  periodicity  search  algorithm  is  based 
on  data  cubes  and  OLAP  operations  combined  with  some  sequential  pattern  search 
strategies  to  discover  large  periodic  patterns  in  a  time  series  (or  TDB). 

2.  Arbitrary  Periodicity  Search:  This  kind  of  periodicity  search  algorithms  are  based 
on  mathematical  techniques:  sequential  algorithms,  forward  optimization  algorithms 
and  backward  optimization  algorithms. 

A  general  theory  for  Searching  Periodical  Problems  in  a  time-series  TDB  is  still 
lacking,  but  we  can  consider  some  main  steps  for  searching  periodical  problems: 

-  determine  some  definitions  of  the  concept  of  a  period  under  some  assumptions  so 
that  we  know  what  kind  of  a  periodicity  search  we  want  to  perform  on  the  TDB. 

-  build  up  a  set  of  algorithms  that  allow  us  to  use  properties  of  periodic  time  series 
for  finding  out  periodic  patterns  from  a  subset  of  the  TDB. 

-  apply  simulation  algorithms  to  find  patterns  from  the  whole  TDB. 

If  a  time-series  TDB  contains  unhealthy  data  such  as  noisy  data  and  missing  data, 
then  the  results  of  a  periodicity  search  will  not  be  useful. 

Note:  A  lot  of  techniques  have  been  involved  in  this  kind  of  problems  by  using  pure 
mathematical  analysis  such  as  function  data  distribution  analysis  and  so  on  (e.g,  [1]). 


Discussion.  In  a  time-series  TDB,  sometimes  similarity  and  periodical  search  problems 
are  difficult  even  when  there  are  many  existing  methods,  but  most  of  the  methods  are 
either  inapplicable  or  prohibitively  expensive.  In  fact,  similarity  and  periodical  search 
problems  can  be  combined  into  the  problem  of  finding  interesting  sequential  patterns 
in  TDBs.  Since  sequential  patterns  are  essentially  associations  over  temporal  data,  they 
utilize  some  of  the  ideas  initially  proposed  for  the  discovery  of  association  rules  (e.g, 
[8,9]).  In  recent  years,  some  new  algorithms  have  been  developed  such  as: 
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-  generalized  sequential  pattern  (GSP)  algorithm:  it  essentially  performs  a  level-wise 
or  breadth-first  search  of  the  sequence  lattice  spanned  by  the  subsequence  relation, 

-  sequential  pattern  discovery  using  equivalence  classes  (SPADE)  algorithm:  it  de¬ 
composes  the  original  problem  into  smaller  sub-problems  using  equivalence  classes 
on  frequent  sequences. 

2.3  Some  Research  Challenges 

We  list  below  some  of  the  challenges  that  are  of  particular  relevance  to  data  mining. 

-  Develop  a  general  theory  for  a  foundation  of  temporal  data  warehousing,  supporting 
multiple  granularities  and  multiple  lines  of  evolution,  for  data  mining  purposes. 

-  Develop  a  general  asymptotic  theory  (parameter  estimating)  for  temporal  database 
models  by  statistical  tools(using  multivariate  functional  analysis). 

-  Develop  a  general  modelling  theory  on  different  temporal  databases  for  forecasting 
of  the  processes  with  or  without  parameters. 

-  Develop  data  reduction  methods  for  removing  redundant  or  irrelevant  data. 

-  Build  up  a  general  technique  for  mapping  a  local  multivariate  time  series,  using  part 
of  temporal  data,  to  /..'-dimensional  space  such  that  the  dissimilarties  (or  interesting 
properties)  are  preserved. 

-  Develop  new  temporal  data  mining  methodologies  such  as  statistical  tools,  neural 
nets  and  ad-hoc  query-based  mining  etc. 

In  the  rest  of  the  paper,  we  return  our  attention  to  a  particular  temporal  data  mining 
method  based  on  a  well-known  statistical  tool,  called  hidden  periodicity  analysis[10]. 

3  Hidden  Periodicity  Analysis 

This  section  first  provides  a  few  definitions  and  results  to  formalize  what  we  mean  by 
periodic  and  similar  patterns,  and  then  discusses  hidden  periodicity  analysis  in  more 
detail. 


3.1  Temporal  Patterns 

Without  loss  of  generality,  we  consider  the  bivariate  data  (Xi,  Yj), . . .  ,(Xn,  Yn),  which 
form  an  independent  and  identically  distributed  sample  from  a  population  ( X ,  Y).  Then 
the  data  as  being  generated  from  the  model  is 


Y  =  m(X)  +  cr(X)£ 

where  E(e)  =  0,  V  ar(e)  =  l,and  X  and  e  are  independent. 

We  assume  that  for  every  successive  pair  of  two  time  points  in  DTS  ti+i  -U  =  /(f) 
is  a  function  (in  most  cases,  /(f)  =  constant).  For  every  succession  of  three  time  points: 
Xj,  Xj+i  and  Xj+2,  the  triple  value  of  (Yj,  Y,+i,  Y}+2)  has  only  different  “  9-states” 
(or,  called  9  local  features).  If  we  let  states:  Ss  is  the  same  state  as  prior  one,  Su  is  the 
go-up  state  compare  with  prior  one  and  Sd  is  the  go-down  state  compare  with  prior  one, 
then  we  have  the  state-space  S  =  {si,  s2,  s3,  s4,  s5,  s6,  s7,  s8,  s9}  =  {(Y^,  Su,  Su),  ( Yj , 
Su,  Ss),  (Yj,  Su,  Sd),  (Yj,  Ss,  Su),  (Yj,  Ss,  Ss),  (Yj,  Ss,  Sd),  (Yj,  Sd,  Su),  (Yj,  Sd,  Ss), 
(Yj,  Sd,  Sd)  }. 
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Definition  3  Let  h  =  {hi,h2,  ■  ■  ■ ,}  be  a  sequence.  If  hj  £  S  for  every  hj  £  h,  then 
the  sequence  h  is  called  a  Structural  Base  sequence  and  the  subsequence  hsub  ofh  is 
called  a  sub-Structural  Base  sequence. 

If  hsu(,  is  a  periodic  sequence,  then  hsui,  may  be  called  sub-structural  periodic  se- 
quence(  e.g,  full  or  partial  periodic  pattern  in  B ).  Also  h  is  a  structural  periodic  sequence 
( existence  periodic  pattern(s)). 

Note  that  a  sequence  is  called  a  full  periodic  sequence  if  its  every  point  in  time 
contributes  (precisely  or  approximately)  to  the  cyclic  behavior  of  the  overall  time  series 
(that  is,  there  are  cyclic  patterns  with  the  same  or  different  periods  of  repetition). 

A  sequence  is  called  a  partial  periodic  sequence  if  the  behavior  of  the  sequence  is 
periodic  at  some  but  not  all  points  in  the  time  series. 

Definition  4  Let  y  =  {yi ,  y2 , - }  be  a  real  value  sequence.  Ifysub  =  {v\,  Vi, - Vn) 

(yj  £  Y,j  =  1,2 ,...  ,N)  and  N  is  the  size  of  a  subset  of  y  ,  then  it  may  be  called  a 
value-point  process.  If  yj  with  0  <  yk  <  1  (mod  1)  for  all  N,then  we  say  that  y  is 
uniformly  distributed  if  every  subinterval  of  [0,  1  ]  gets  its  fair  share  of  the  terms  of  the 
sequence  in  the  long  run.  More  precisely,  if 

Urn  _  length  0f  J 

n— *■  oo  n 

for  all  subintervals  J  of  [0,  1). 

Definition  5  Let  y  =  {yi,y2, _ }  be  a  sequence  of  real  numbers  with  I  —  6  <  yk  < 

I  +  5  for  all  k.  We  say  that  y  has  an  approximate  constant  sequence  distribution.  More 
generally,  if  h(t)  —  6  <  yk  <  h(t)  +  6  for  all  k,  we  say  that  y  has  an  approximate 
distribution  function  h(  t). 

We  have  the  following  results(  [11]): 

Lemma  1.  A  discrete-valued  dataset  contains  periodic  patterns  if  and  only  if  there 
exist  structural  periodic  patterns  and  periodic  value-point  processes  with  or  without  an 
independently  identical  distribution  (i.i.d.). 


Lemma  2.  In  a  discrete-valued  dataset,  there  exist  similarity  patterns  if  and  only  if 
there  exist  structural  base  periodic  patterns  and  similarity  value-point  distribution  with 
or  without  an  independently  identical  distribution. 

3.2  Methods  of  Hidden  Periodicity  Analysis 

We  briefly  introduce  the  hypothesis  testing  method  of  Grenander  [10]  for  detecting 
hidden  periodicities  in  noisy  data.  Suppose  that  the  model  of  general  observations  of  a 
sub-set  is  that  of 

p 

x{t)  =  ^2^neiXnt+rl{t),  t  £  Z 

n=l 
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where  P  is  known,  and  A„  are  unkown  parameters,  rj(t)  is  independently  identical 
distributed  (i.i.d.  jV(0,  cr2))  and  a  is  unkown  parameter  Then  Grenander  suggested 
the  following  testing  procedure: 

H0  :  x(t)  =  7(<),  t  =  0,  ±1,  ±2,  .... 

Hi  :  M  =  r  (£(f)  possesses  r  frequency  components,  1  <  r  <  m) 

In  the  hypothesis  testing,  the  parameter  r  is  assumed  to  be  known  a  priori.  Since  in 
usual  cases  r  is  unknown,  we  apply  the  testing  step  by  step,  i.e.  first  put  r  =  1  and  apply 
the  testing.  If  H  is  rejected  then  put  r  =  2  and  so  on,  until,  say  r  =  p  +  1,  when  H  is 
accepted  then  we  estimate  the  order  as  p.  If  r  =  1  is  rejected,  then  it  means  £(f)  is  not  a 
white  noise  series,  so  the  distribution  of  P{g(r )  z}  has  to  be  changed. 

4  Discovery  of  Temporal  Patterns 

4.1  Structural  Pattern  Discovery 

From  the  point  of  view  of  our  new  method  in  data  analysis,  we  use  squared  distance 
functions  which  are  provided  by  a  class  of  positive  semidefinite  quadratic  forms.  Speci¬ 
fically,  if  u  =  (uj ,  us ,  ■  ■  ■ ,  uv  )  denotes  the  p-dimensional  observation  of  each  different 
distance  of  patterns  in  a  state  on  an  object  that  is  to  be  assigned  to  one  of  the  g  prespeci¬ 
fied  groups,  then,  for  measuring  the  squared  distance  between  u  and  the  centroid  of  the 
ith  group,  we  can  consider  the  function 

D2(i)  =  (  u-y)'M(u-y) 

where  M  is  a  positive  semidefinite  matrix  to  ensure  the  D2(i)  >  0.  Different  choices 
of  the  matrix  M  lead  to  different  metrics,  and  the  class  of  squared  distance  functions 
represented  by  above  equation  is  not  unduly  narrow. 

4.2  Point- Value  Pattern  Discovery 

Here,  we  introduce  an  enhancement  of  an  approach  for  modelling  discrete-valued  time 
series  through  hidden  periodicity  analysis.  On  the  value-point  pattern  discovery,  suppose 
that  the  model  of  observations  is  of  subsection  3.2 

The  first  stage  of  method  for  detecting  the  characteristics  of  those  records  is  to  use 
the  linear  regression  analysis.  We  may  assume  linear  model  is  Y  =  X/3  +  e.  The  linear 
model  based  upon  least  square  estimation  (LSE)  is  /3  =  (XTX)-1XTY.  Then  we  have: 
(3  ~  N(j3,  Cov{f3)).  Particularly,  for  /?;  we  have  fit  ~  N (f3t  ,o2),  where  a  2  =  a2 an, 
and  an  is  the  ith  diagonal  element  of  (XTX)-1. 

Now,  for  each  value-point,  we  may  fit  a  linear  model  as  above  and  parameters  can  be 
estimated  under  LSE.  Therefore,  we  first  remove  the  trend  effect  of  each  curve  from  the 
original  record  by  subtracting  the  above  regression  function  at  x  from  the  corresponding 
value  to  obtain  a  comparatively  stationary  series.  Then  the  problem  can  be  formulated 
as  the  hidden  periodicity  analysis  of  discrete- valued  time  series. 

1  In  fact,  Grenander  considered  the  model  £(t)  =  AkCos(wkt)  +7(t)  where  P  is  known. 

Ak,  uJk  are  unkown  parameters,  j(t)  is  i.i.d.  N( 0,  a2)  and  a  is  unkown  parameter 
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4.3  Experimental  Results 

For  brevity,  we  only  present  a  few  experimental  results  for  both  structural  and  value-point 
pattern  discovery,  on  the  exchange  rate  between  the  US  and  Canadian  dollars. 

Structural  pattern  discovery  experiments.  We  are  investigating  the  sample  of  the 
structural  base  to  test  the  naturalness  of  the  similarity  and  periodicity  on  Structural  Base 
distribution.  We  consider  9  states  in  the  state-space  of  structural  distribution:  S  =  {si, 
s2,  s3,  s4,  s6,  s7,  s8,  s9}.  In  summary,  some  results  for  the  structural  base  experiments 
are  as  follows  (e.g.,  see  right  Figure  of  1). 

-  Structural  distribution  in  a  practical  transition  of  states  is  a  hidden  periodic  distri¬ 
bution  with  a  periodic  length  function  f(t). 

-  There  exist  some  partial  periodic  patterns  in  a  practicular  transition  of  states  based 
on  a  distance  shifting  function  d(t). 

-  There  also  exist  some  similarity  patterns  with  a  small  distance  shifting  in  a  practicular 
transition  of  states. 


Fig.  1.  Left:  200  business  days  of  daily  U.S.  dollar  exchange  rate  against  Canadian  dollar  in  states 
base.  Right:  After  removal  the  trend  effect  represented  by  linear  regression  function  w(t)  for  the 
1257  business  days. 


Value-point  pattern  discovery  experiments.  Suppose  the  exchange  rate  value  can  be 
modelled  as 

Yi  =  rn(t)Yi+k  +  (k  >  0  and  fixed  integer) 

We  then  adjust  off  the  trend  by  the  linear  regression  function  Yr  and  a  new  series 

w(t)  —  v(t)  —  Y(t),  t  =  1,2,  ...,1V 

may  be  obtained,  where  v(t)  is  the  original  record. 

Then  we  may  use  the  linear  regression  by  hidden  perodicity  analysis  for  the  new 
value  base  series  w(t).  Some  results  for  the  value-point  of  experiments  are  given  below: 
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-  there  does  not  exist  any  full  periodic  pattern,  but  there  exist  some  partial  periodic 
patterns  with  a  distance  shifting  function  dj(t), 

-  there  exist  similarity  patterns,  etc. 


5  Concluding  Remarks 

This  paper  has  reviewed  current  research  problems  and  challenges  in  temporal  data 
mining.  It  has  also  presented  a  new  method  based  on  Hidden  Periodicity  Analysis  for 
finding  patterns  in  discrete-valued  time  series  databases.  The  method  described  in  this 
paper  is  still  in  its  preliminary  stages.  But  it  guarantees  finding  different  patterns  with 
structural  and  valued  probability  distribution  of  a  real-dataset.  The  method  can  be  im¬ 
plemented  using  a  straightforward  algorithm,  and  the  results  of  preliminary  experiments 
are  promising. 
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Abstract.  Previous  methods  on  mining  association  rules  require  users 
to  input  a  minimum  support  threshold.  However,  there  can  be  too  many 
or  too  few  resulting  rules  if  the  threshold  is  set  inappropriately.  It  is 
difficult  for  end-users  to  find  the  suitable  threshold.  In  this  paper,  we 
propose  a  different  setting  in  which  the  user  does  not  provide  a  support 
threshold,  but  instead  indicates  the  amount  of  results  that  is  required. 


1  Introduction 

In  recent  years,  there  have  been  a  lot  of  studies  in  association  rule  mining.  An 
example  of  such  a  rule  is  : 

\/x  6  per  sons, buy  s{x”  biscuit”)  =>  buys(x,”  orange  juice”) 

where  oris  a  variable  and  buy(x,y)  is  a  predicate  that  represents  the  fact  that  the 
item  y  is  purchased  by  person  x.  This  rule  indicates  that  a  high  percentage  of 
people  that  buy  biscuits  also  buy  orange  juice  at  the  same  time,  and  there  are 
quite  many  people  buying  both  biscuits  and  orange  juice. 

Typically,  this  method  requires  the  users  to  specify  the  minimum  support 
threshold,  which  in  the  above  example  is  the  minimum  percentage  of  transactions 
buying  both  biscuits  and  orange  juice  in  order  for  the  rule  to  be  generated. 
However,  it  is  difficult  for  the  users  to  set  this  threshold  to  obtain  the  result 
they  want.  If  the  threshold  is  too  small,  a  very  large  amount  of  results  are 
mined.  It  is  difficult  to  select  the  useful  information.  If  the  threshold  is  set  too 
large,  there  may  not  be  any  result.  Users  would  not  have  much  idea  about  how 
large  the  threshold  should  be.  Here  we  study  an  approach  where  the  user  can 
set  a  threshold  on  the  amount  of  results  instead  of  the  threshold. 

We  observe  that  solutions  to  multiple  data  mining  problems  including  min¬ 
ing  association  rules  [2,4],  mining  correlation  [3],  and  subspace  clustering  [5], 
are  based  on  the  discovery  of  large  itemsets,  i.e.  itemsets  with  support  greater 
than  a  user  specified  threshold.  Also,  the  mining  of  large  itemsets  is  the  most 
difficult  part  in  the  above  methods.  Therefore,  we  would  like  to  mine  the  inter¬ 
esting  itemsets  instead  of  interesting  association  rules  with  the  constraint  on  the 
number  of  large  itemsets  instead  of  the  minimum  support  threshold  value.  The 
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resulting  interesting  itemsets  are  the  N-most  interesting  itemsets  of  size  k  for 
each  k  >  1. 


2  Definitions 

Similar  to  [4],  we  consider  a  database  D  with  a  set  of  transactions  T,  and  a  set 
of  items  /  =  ii ,  *2,  in-  Each  transaction  is  a  subset  of  I,  and  is  assigned  a 
transaction  identifier  <  TID  >. 

Definition  1.  A  A;-itemset  is  a  set  of  items  containing  k  items. 

Definition  2.  The  support  of  a  k-itemset  (X)  is  the  ratio  of  number  of  trans¬ 
actions  containing  X  to  the  total  number  of  transactions  in  D. 

Definition  3.  The  N- most  interesting  A;-itemsets  ;  Let  us  sort  the  k-itemsets 
by  descending  support  values,  let  S  be  the  support  of  the  N-th  k-itemset  in  the 
sorted  list.  The  N-most  interesting  k-itemsets  are  the  set  of  k-itemsets  having 
support  >  S. 

Given  a  bound  m  on  the  itemset  size,  we  mine  the  JV-most  interesting  k- 
itemsets  from  the  transaction  database  D  for  each  k,  1  <  k  <  m. 

Definition  4.  The  N- most  interesting  itemsets  is  the  union  of  the  N-most 
interesting  k-itemsets  for  each  1  <  k  <  m.  That  is,  N-most  interesting  itemset 
=  N-most  interesting  1-itemset  U  N-most  interesting  2-itemset  U  ...  U  N-most 
interesting  m-itemset.  We  say  that  an  itemset  in  the  N-most  interesting  itemsets 
is  interesting. 

Definition  5.  A  potential  /c-itemset  is  a  k-itemset  that  can  potentially  form 
part  of  an  interesting  (k+lf-itemset. 

Definition  6.  A  candidate  A:-itemset  is  a  k-itemset  that  potentially  has  suf¬ 
ficient  support  to  be  interesting  and  is  generated  by  joining  two  potential  (k  —  1)- 
itemsets. 

A  potential  fc-itemset  is  typically  generated  by  grouping  itemsets  with  sup¬ 
port  greater  than  a  certain  value.  A  candidate  Ar-itemset  is  generated  as  in  the 
apriori-gen  function. 


3  Algorithms 

In  this  section,  we  propose  two  new  algorithms,  which  are  Itemset-Loop  and 
Itemset-iLoop,  for  mining  N- most  interesting  itemsets.  Both  of  the  algorithms 
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have  a  flavor  of  the  Apriori  algorithm  [4]  but  involve  backtracking  for  avoiding 
any  missing  itemset.  The  basic  idea  is  that  we  automatically  adjust  the  support 
thresholds  at  each  iteration  according  to  the  required  number  of  itemsets.  The 
notations  used  for  the  algorithm  are  listed  below. 


Pk 

Set  of  potential  1-itemsets,  sorted  in  descending  order  of  the  support  values. 

support k 

The  minimum  support  value  of  the  TV-th  1-itemset  in  Pk. 

lastsupportk 

The  support  value  of  the  last  1-itemset  in  Pk. 

ck 

Set  of  candidate  1-itemsets. 

h 

Set  of  interesting  1-itemsets. 

I 

Set  of  all  interesting  itemsets.  (N- most  interesting  itemsets) 

3.1  Mining  TV-most  Interesting  Itemsets  with  Itemset-Loop 

This  algorithm  has  the  following  inputs  and  outputs. 

Inputs  :  A  database  D  with  the  transaction  T,  the  number  of  interesting 
itemsets  required  (TV),  the  bound  on  the  size  of  itemsets  (m). 

Outputs  :  TV-most  Interesting  k  itemsets  for  1  <  k  <  m 

Method  :  In  this  algorithm,  we  would  find  some  1-itemsets  that  we  call  the 
potential  1-itemsets.  The  potential  1-itemsets  include  all  the  TV-most  interesting 
1-itemsets  and  also  extra  1-itemsets  such  that  two  potential  1-itemsets  may  be 
joined  to  form  interesting  ( k  +  l)-itemsets  as  in  the  Apriori  algorithm. 

First,  we  find  the  set  Pi  of  potential  T-itemsets.  Suppose  we  sort  all  1-itemset 
in  descending  order  of  support.  Let  S  be  the  support  of  the  TV-th  1-itemset  in 
this  ordered  list.  Then  Pi  is  the  set  of  1-itemsets  with  support  greater  than  or 
equal  to  S.  At  this  point  Pi  is  the  TV-most  interesting  1-itemsets.  The  candidate 
^-itemsets  (C2)  are  then  generated  from  the  potential  1-itemsets. 

The  potential  Ikitemsets  P2  are  generated  from  candidate  ^-itemsets.  P2  is  the 
TV-most  interesting  2-itemsets  among  the  itemsets  in  C2.  If  supports  is  greater 
than  lastsupporti,  it  is  unnecessary  for  looping  back.  This  is  the  pruning  elfect. 
If  support 2  is  less  than  or  equal  to  lastsupporti,  it  means  that  we  have  not 
uncovered  all  1-itemsets  of  sufficient  support  that  may  generate  a  2-itemset  with 
support  greater  than  support2 .  The  system  will  loop  back  to  find  new  potential 
1-itemsets  whose  supports  are  not  less  than  support2.  Pi  is  augmented  with 
these  1-itemsets,  and  the  value  of  lastsupporti  is  also  updated.  C2  is  generated 
again  from  Pi .  The  new  potential  T-itemsets  may  produce  candidate  potential 
H-itemsets  having  support  >  the  value  of  support 2  in  the  above.  P2  is  generated 
again  from  C2,  it  now  contains  the  TV-most  interesting  2-itemsets  from  C2 •  The 
values  of  support2  and  lastsupport2  are  updated. 

For  mining  potential  3-itemsets,  the  system  will  find  the  candidate  3-itemsets 
from  P2  with  the  Apriori-gen  algorithm.  After  finding  3-itemsets,  support 3,  and 
lastsupports ,  it  will  compare  supports  and  lastsupporti. 
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Algorithm  1  :  Itemset~Loop 

var:  1  <  k  <  m,  support k,  lastsupportk,  N,  Ck .  Pk,  D 

(Pi  ,$upporti, lastsupporti)  =  find^potential  J.Jtemset(jD,Ar); 

C2  =  gen_candidate(Pi); 
for  (k~2;k  < 

(Pk  , support k, lastsupportk)  —  find-N-potential-k Jtemset(Ck,N ,k); 
if  k  <  m  then  Ck+i  =  gen_candidate(P*);  } 

Ik  =  N- most  interesting  fe-itemsets  in  Pk ; 

/  =  U*  Ik ; 
return  (/); 

find_N_potential_kJtemset(Cfc,iV,A:) 

{ 

(P* ,  support  k, /ast5«pporffc)=find_potential_kJtemset(Cfc,  N)\ 
new  support  =  support  *; 

for(i=2;i  <=  /?;*++)  updatedi  =  FALSE; 
for(t=l;t  <  { 

if  (»  =  1)  { 

if  (neu/supporfc  <  lastsupporti)  { 

(Pi,  support  i,  last  support  *)  =s  find_potential_l_itemsets_with_support(Z),  new  support)] 
if  i  <  k  then  Cj+i  =  gen_candidate(Pj); 
if  Ci+i  is  updated  then  updated^  1  ==  TRUE;  }  } 
else  { 

if  (new support  <  lastsupporti  or  updatedi  =  TRUE)  { 

(Pi,  support  i,  lastsupporti)  —  find_potential_k  Jtemsets_with_support(C'j,neiusuppor£); 
if  i  <  k  then  Ci+\  —  gen_candidate(P{); 
if  Cj+i  is  updated  then  updatedi+i  =  TRUE;  }  } 
if  (no.  of  /c-itemsets  <  N  and  i  z=  k  and  k  —  m)  { 
newsupport  =  reduce(nett>suppori); 
for(J=2y  <=  /e;j++)  updated  j  =  FALSE; 

return(P*, support*, /as  tsupport*); 

} 


Fig.  1.  Itemset-Loop 
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Fig.  2.  Sketch  of  the  iterations  in  the  step  for  mining  N-most  interesting  4-itemsets 
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-  If  last  support  i  is  greater  than  supports,  it  means  that  there  may  be  some 
relevant  1-itemsets  missing.  Pi  will  be  augmented  by  including  1-itemsets 
whose  supports  are  >  supports .  The  value  of  lastsupport\  is  updated  ac¬ 
cordingly.  The  set  C2  candidate  H-itemsets  will  be  generated  from  Pi  again. 
After  that  P2  is  generated  from  C2  including  all  itemsets  with  support  > 
supports.  lastsupport2  is  updated  accordingly. 

-  If  lastsupport  1  is  not  greater  than  supports,  supports  will  be  compared  with 
lastsupport2  of  P2.  similar  processing  is  applied  to  update  P2,  C3  and  P3. 


This  process  is  iterated  with  larger  and  larger  itemsets  and  stops  at  the  user 
specified  bound  m  on  the  itemset  size.  Figure  2  (a)  illustrates  the  idea.  Next  we 
describe  the  functions  used. 

find_potentialJ.  _itemset(.D,TV)  :  This  function  finds  the  TV-most  interest¬ 
ing  1-itemsets  and  returns  these  itemsets  as  the  potential  1-itemsets  together 
with  their  supports.  The  itemsets  are  sorted  in  descending  order  of  the  supports 
and  are  placed  in  Pi.  In  order  to  obtain  the  support  values,  this  function  scans 
all  the  transaction  records  in  the  database. 

The  minimum  support  among  the  return  itemsets  is  recorded  as  supporti 
and  also  lastsupporti. 

gen_candidate(P/j)  :  This  function  generates  the  candidate  {k+1)- itemsets 
from  potential  A-itemsets  using  the  Apriori-gen  function  [4] .  It  will  also  scan  the 
database  to  count  the  support  for  the  newly  generated  candidate  itemsets.  A 
hash  tree  is  used  in  this  process  as  in  [4]. 

find_N_potential_k_itemset(Cfc,TV,A)  :  This  function  finds  the  TV-most  in¬ 
teresting  A-itemsets.  The  system  will  first  compare  supportk  with  lastsupporti. 

If  supportk  <  lastsupporti,  the  potential  1-itemset  is  updated  by  adding  all  1- 
itemsets  with  support  >  supportk  ■  Then  candidate  2-itemsets  C2  will  be  updated 
if  necessary.  The  process  is  repeated  with  1-itemsets  for  2  <  /  <  k. 

find_potentialJcJtemset(C'/c,TV)  :  This  function  finds  potential  A-itemsets 
from  the  candidate  A-itemsets  in  C'k-  The  TV-most  interesting  A-itemsets  in  Ck 
is  returned.  The  values  of  supportk  and  last  supportk  are  also  returned. 

find_potentialJL Jtemset_with_support(.D, newsupport)  :  This  function 
finds  all  potential  1-itemsets  with  the  support  >  newsupport.  All  itemsets  with 
sufficient  support  are  stored  into  the  potential  1-itemset  (Pi).  These  itemsets 
are  returned  together  with  lastsupporti  and  supporti. 

find_potential_kJtemsets_with_support(C,-,neu;supporf)  :  This  function 
finds  the  potential  A-itemsets  with  the  newsupport  value  and  the  candidate  A- 
itemsets.  The  candidates  in  C,-  are  scanned  and  those  having  support  >  newsupport 
are  returned.  These  are  returned  as  Pk,  the  values  of  lastsupportk,  and  supportk 
are  also  updated  and  returned. 

reduc e(newsupport)  :  This  function  reduces  the  newsupport  value  for  min¬ 
ing  TV  potential  A-itemsets  if  there  are  no  enough  TV  potential  A-itemsets. 
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Correctness:  The  correctness  of  the  algorithm  is  based  on  the  downward 
closure  of  large  itemsets  :  If  a  fc-itemset  X  =  {X\, ...,  Xk}  is  large,  then  a  (k  —  1)- 
itemset  Y  C  X  must  also  be  large.  When  we  compute  the  N  largest  A-itemsets, 
and  discovers  the  smallest  support  of  the  itemsets  is  S,  then  for  a  (k  —  1)- 
itemset,  if  the  support  is  less  than  S,  it  cannot  form  part  of  an  interesting 
&-itemset.  Hence  if  we  have  considered  all  the  ( k  —  l)-itemsets  with  support  >  S 
in  the  generation  of  candidate  A-itemsets,  we  have  not  missed  any  interesting  k- 
itemsets.  Otherwise,  the  algorithm  loops  back  to  uncover  all  the  smaller  itemsets 
to  uncover  all  /-itemsets  l  <  k  which  have  support  >  S. 


3.2  Second  Algorithm  :  Itemset-iLoop 

The  first  approach  requires  loop  back  in  the  Ar-th  iteration  to  generating  item- 
sets  of  size  1,  2,  k  —  1  in  that  order,  using  a  support  bound  S  generated  at 
the  A-itemsets.  One  alternative  is  the  following  :  we  loop  back  first  to  generate 
extra  ( k  —  l)-itemsets  using  5,  then  using  these  extra  ( k  —  l)-itemsets,  we  may 
generate  more  A-itemsets.  With  the  newly  generated  A-itemsets,  if  any,  we  may 
be  able  to  to  come  up  with  a  support  bound  S'  greater  than  S.  With  S',  we 
may  require  the  generation  of  less  itemsets  of  size  less  than  k  —  1.  This  process 
can  be  repeated  with  itemsets  of  size  k  —  2,  k  —  3,  ...  1.  Hence  we  propose  a 
second  algorithm  based  on  this  technique.  The  second  proposed  algorithm  is 
similar  to  the  first  algorithm  except  that  at  the  A-th  iteration,  instead  of  loop 
backing  to  the  generation  of  potential  1-itemsets,  we  loop  back  first  to  examine 
( k  —  l)-itemsets.  The  algorithm  is  called  Itemset-iLoop.  This  algorithm  has  the 
same  inputs  and  outputs  as  Algorithm  itemset-Loop. 

Method  :  The  functions  in  the  algorithm  are  the  same  as  the  corresponding 
functions  in  Itemset-Loop  algorithm  except  for  the  following: 

find_N_potential_kJtemset(CA,,A,A)  :  This  function  finds  n  potential  k- 
itemsets  given  the  candidate  A-itemsets  Ck  and  a  new  support,  supportk ■  If 
supportk  >  lastsupportk- 1,  it  is  not  necessary  to  update  Pk-i-  If  supportk  < 
last  supportk -l  >  the  potential  ( k  —  l)-itemsets  (Pk- i)  will  be  updated.  The  miss¬ 
ing  ( k  —  l)-itemsets,  which  have  support  greater  than  or  equal  to  supportk, 
will  be  inserted  into  (Pk- 1).  Then  candidates  Ck  and  Pk  with  supportk,  and 
lastsupportk  will  be  updated.  After  this,  the  system  will  compare  supportk  with 
last  supportk -2,  the  potential  ( k  —  2)-itemsets  (Pk- 2)  may  be  updated  in  a  sim¬ 
ilar  manner.  Then  the  potential  ( k  —  l)-itemsets,  supportk-i,  lastsupportk -1, 
the  potential  A-itemsets,  supportk ,  and  lastsupportk  will  be  updated  accordingly. 
This  is  repeated  with  lastsupport  for  indices  k  —  3,  k  —  4,  ...  1.  In  each  case,  we 
compare  supportk  with  all  lastsupporti  where  i  <  k,  and  update  P;  if  necessary. 
Pj  may  be  updated  at  every  pass,  where  j  >  i,  if  Pj  is  updated. 

Note  that  the  first  two  iterations  are  the  same  as  that  in  Algorithm  Itemset- 
Loop.  Figure  2  (b)  is  a  sketch  of  the  iterations  for  mining  potential  ^-itemset. 
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Algorithm  2  :  Itemset-iLoop 
find  J\J  potential  JcJtemset(C'fc,iVl/c) 

{ 

(Pk,S  upport  k  ,last  support  fc)=find_potential_k_itemset(C  k  ,N); 
new  support  =  supportk  I 
for(j=fc  —  l;t  >  l;t=t  —  1)  { 

if  (new  support  <  lastsupporti)  { 
for  (j=i;j  <  /e;j++)  { 
if(j  =  1)  { 

Pj  =  find_potential-l_itemset_withj5Upport(£),  new  support)]  } 
else  { 

Pj  =  find-potential-kJtemset.with-supportfCj.neiysupport);  } 

if ti  =  k){ 

new  support  =  supportk’,  } 

i Hi  ^  k)  { 

Cj+i  =  gen_candidate(Pj); 

}}} 

if  (no.  of  /c-itemsets  <  N  and  i  =  1  and  k  —  m)  { 
new  support  —  re<\y\c&(newsupport)\ 
i  =  k  -  1;  }  } 

retu  rn(Pk, support  k.lastsupportk)’, 

} 


Fig.  3.  Itemset-iLoop 


4  Experimental  Results 

In  this  section,  we  present  the  performance  analysis  of  the  algorithms  Itemset- 
Loop  and  Itemset-iLoop  and  comparison  with  the  Apriori  algorithm  [4].  All  ex¬ 
periments  were  carried  out  on  a  SUN  ULTRA  5_10  machine  running  SunOS  5.6. 
The  workstation  has  128MB  memory.  The  hash-tree  data  structure  [4]  is  used 
for  keeping  candidate  itemsets.  Both  synthetic  datasets  and  real  datasets  were 
used. 

The  real  data  comes  from  census  of  United  States  1990.  The  US  census 
database  is  available  at  the  web  site  of  IPUMS-981.  The  experiments  are  based 
on  two  sets  of  real  data:  a  small  database  with  5577  tuples  and  77  different 
items,  and  a  large  database  with  57972  tuples  and  77  different  items.  For  each 
database,  we  investigate  the  performance  under  different  values  of  N  in  the  Al¬ 
most  interesting  itemsets.  The  different  values  of  N  are  5,  10,  15,  20,  25,  and 
30.  We  mine  itemsets  up  to  size  4,  hence  fc-itemsets  are  mined  for  1  <  k  <  4. 
For  the  function  reduce(newsupport)  in  our  proposed  algorithms,  we  choose  a 
factor  of  0.8,  meaning  that  when  the  function  is  called,  the  value  of  newsupport 
is  reduced  to  be  0.8  times  its  original  value. 

In  Figure  4(a)  and  4(b),  we  show  the  performance  of  the  Itemset-Loop  al¬ 
gorithm,  the  Itemset-iLoop  algorithm,  and  the  Apriori  algorithm  with  different 
support  thresholds  for  the  small  and  the  large  databases  respectively.  We  per¬ 
form  the  algorithms  Itemset-Loop  and  Itemset-iLoop  first  and  take  the  minimum 
support  thresholds  under  every  N,  where  N  are  5, 10, 15,  20,  25,  and  30  after  min¬ 
ing  ^-itemsets.  And  we  use  the  notations  minsup  to  represent  these  thresholds. 


1  The  URL  of  IPUMS-98  is  http://www.ipums.umn.edu/. 
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(a)  small  database  (b)  large  database 

Fig.  4.  Performance  with  the  growth  of  the  number  of  TV-most  interesting  itemsets 


For  the  tiny  database,  the  thresholds  are  found  to  be  {0.097,  0.069,  0.062,  0.06, 
0.058,  0.054}  for  N  —  5, 10, 15,20,25,30,  respectively.  For  the  large  database, 
the  thresholds  are  found  to  be  {0.22,  0.22,  0.22,  0.14,  0.13,  0.11}.  2  We  apply 
the  Apriori  algorithm  with  these  thresholds  to  measure  the  execution  time.  We 
also  apply  the  Apriori  algorithm  with  0.8,  0.6,  0.4,  and  0.2  of  these  thresholds, 
which  we  call  minsupo.s,  minsipo.6,  minsupoA,  and  minsupo,2<  respectively. 

In  general,  the  performance  of  Itemset-Loop  algorithm  is  better  than  that 
of  Itemset-iLoop  algorithm.  This  is  because  the  Itemset-Loop  algorithm  loops 
back  to  the  f-itemset  first  every  time  and  updates  &-itemset  for  k  >  1  if  nec¬ 
essary.  On  the  other  hand,  the  Itemset-iLoop  algorithm  loops  back  to  check 
(fc-f)-itemsets  first  and  does  comparisons.  Then  it  loops  back  to  check  (k-2)- 
itemsets  and  updates  (fc-f)-itemsets  and  fc-itemset  if  necessary,  and  so  on  so  for. 
It  may  involve  more  back-tracking  than  the  Itemset-Loop  algorithm.  The  Apriori 
algorithm  can  provide  the  optimum  results  if  the  user  knows  the  exact  maximum 
support  thresholds  that  can  generate  the  TV-most  interesting  results.  We  refer 
to  this  threshold  as  the  optimal  threshold.  Otherwise,  the  proposed  algorithms 
perform  better. 

We  have  studied  the  execution  time  for  every  pass  using  the  Itemset-Loop 
and  the  Itemset-iLoop  algorithms.  Since  we  only  record  TV  or  a  little  bit  more 
for  each  itemsets  for  every  fc-itemset  at  the  first  step,  it  may  be  necessary  to 
loop  back  for  updating  the  result  in  both  algorithms  proposed.  In  general  the 
increase  of  TV  leads  to  the  increase  of  execution  time.  However,  sometimes  less 
looping  back  is  necessary  for  a  greater  value  of  TV  and  a  decrease  in  execution 
time  is  recorded. 

Table  1  shows  the  total  number  of  unwanted  itemsets  generated  by  the  Apri¬ 
ori  algorithm  in  the  large  database  when  the  guess  of  the  thresholds  is  not 
optimal.  The  thresholds  of  minsupi,  where  T=0.8,  0.6,  0.4  and  0.2,  are  used, 

2  Notice  that  the  optimal  thresholds  can  vary  by  orders  of  magnitude  from  case  to 
case,  and  it  is  very  difficult  to  guess  the  optimal  thresholds. 
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N 

minsupo.8 

minsupo.6 

minsupoA 

minsupo.2 

5 

251 

395 

583 

1236 

10 

255 

379 

567 

1220 

15 

105 

250 

437 

947 

20 

372 

541 

921 

1624 

25 

374 

587 

957 

1854 

30 

467 

800 

1016 

2322 

Table  1.  Number  of  unwanted  itemsets  generated  by  Apriori  (large  database) 


minsupi  is  i  times  the  optimal  minimum  support  thresholds.  We  can  see  that 
the  unwanted  information  can  increase  very  dramatically  with  the  deviation  from 
the  optimal  thresholds. 

We  have  also  carried  out  another  set  of  experiments  on  synthetic  data.  The 
results  are  similar  in  that  the  proposed  method  is  highly  effective  and  can  outper¬ 
form  the  original  method  by  a  large  margin  if  the  guess  of  the  minimum  support 
threshold  is  not  good.  For  the  interest  of  space  the  details  are  not  shown  here. 


5  Conclusion 

We  proposed  two  algorithms  for  the  problem  of  mining  N-most  interesting  k- 
itemsets.  We  carried  out  a  number  of  experiments  to  illustrate  the  performance 
of  the  proposed  techniques.  We  show  that  the  proposed  methods  do  not  introduce 
much  overhead  compared  to  the  original  method  even  with  the  optimal  guess  of 
the  support  threshold.  For  thresholds  that  are  not  optimal  by  a  small  factor,  the 
proposed  methods  have  much  superior  performance  in  both  efficiency  and  the 
generation  of  useful  results. 
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Abstract  Metadata  represent  the  vehicle  by  which  digital  documents  can  be 
efficiently  indexed  and  retrieved.  The  need  for  such  kind  of  information  is 
particularly  evident  in  multimedia  digital  libraries,  which  store  documents 
dealing  with  different  types  of  media  (text,  images,  sound,  video).  In  this 
context,  a  relevant  metadata  function  consists  in  superimposing  some  sort  of 
conceptual  organization  over  the  unstructured  information  space  proper  to  these 
digital  repositories,  in  order  to  facilitate  the  intelligent  retrieval  of  the  original 
documents.  To  this  purpose,  the  usage  of  conceptual  annotations  seems  quite 
promising.  In  this  paper,  we  propose  a  two-steps  annotation  approach  by  which 
conceptual  annotations,  represented  in  NKRL  [7],  [8],  are  associated  with 
multimedia  documents  and  used  during  retrieval  operations.  We  then  discuss 
how  documents  and  metadata  can  be  stored  and  managed  on  persistent  storage. 


Introduction 

It  is  today  well  recognized  that  an  effective  retrieval  of  information,  from  large  bodies 
of  multimedia  documents  contained  in  current  digital  libraries,  requires,  among  other 
things,  a  characterization  of  such  documents  in  terms  of  some  metadata.  A  relevant 
metadata  function  consists  in  superimposing  some  sort  of  conceptual  organization 
over  the  unstructured  information  space  often  typical  of  digital  libraries,  in  order  to 
facilitate  the  intelligent  retrieval  of  the  original  documents.  Querying  or  retrieving 
various  types  of  digital  media  is  executed  directly  at  the  metadata  level. 

Among  the  classes  of  metadata  proposed  by  the  scientific  literature,  only  content- 
specific  metadata  „reflect  the  semantics  of  the  media  object  in  a  given  context"  and 
provide  a  sufficient  degree  of  generality  [1],  Unfortunately,  as  well  known,  a  veritable 
access  by  semantic  content  is  particularly  difficult  to  achieve,  especially  for  non¬ 
textual  material  (images,  video,  audio).  In  those  cases,  content-based  access  is  often 
supported  by  the  use  of  simple  keywords,  or  of  features  mainly  related  with  the 
physical  structure  of  multimedia  documents  (such  as  colour,  shape,  texture,  etc.)  [4]. 
In  order  to  overcome  the  limitations  of  such  approaches,  conceptual  annotations  have 
been  introduced  for  describing  in  some  depth  the  context  of  digital  objects  [2],  [3], 
[6],  However,  the  current  approaches,  often  based  on  the  use  of  simple  ontologies  in  a 
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description  logic  style,  have  several  limitations  in  terms  of  description  of  complex 
semantic  contents  (e.g.,  of  complex  events)  events. 

To  alleviate  these  problems,  we  propose  a  different  approach  for  building  up 
conceptual  annotations  to  be  used  for  indexing  documents  stored  in  a  thematic 
multimedia  library.  With  thematic  multimedia  library  we  mean  a  library  storing 
documents  concerning  a  given  application  domain.  Our  approach  is  based  on  a  two 
steps  annotation  process: 

•  During  the  first  step,  any  interesting  multimedia  document  is  annotated  with  a 
simple  Natural  Language  (NL)  caption  in  the  form  of  a  short  text,  representing  a 
general,  neutral  description  of  the  content  of  the  document.  In  the  case  of  textual 
documents,  the  interesting  parts  of  the  text,  or  the  text  itself,  could  represent  the 
NL  caption.  This  approach  corresponds  to  the  typical  process  of  annotating  a  paper 
document,  by  underlying  the  interesting  parts  or  writing  down  remarks  and 
personal  opinions.  In  the  case  of  other  media  documents,  the  NL  caption  may 
represent  the  semantic  content  of  the  document  and  additional  observations 
associated  with  it. 

•  During  the  second  phase,  annotations  represented  by  NL  captions  are  (semi- 
automatically)  converted  into  the  final  conceptual  annotations.  We  propose  to 
represent  the  final  conceptual  annotations  in  NKRL  ( Narrative  Knowledge 
Representation  Language)  [7],  [8],  In  NKRL,  the  metaknowledge  associated  with  a 
document  consists  not  only  in  a  set  of  concepts  and  instances  of  concepts 
(individuals)  but  also  in  a  structured  set  of  more  complex  structures  (occurrences) 
obtained  through  the  instantiations  of  general  classes  of  events  called  templates. 
This  approach  is  actually  tested  in  the  context  of  an  European  project, 
CONCERTO  (9]. 

Note  that  the  use  of  a  two-steps  annotation  process  guarantees  a  high  level  of 
flexibility  in  querying.  First  of  all,  this  approach  provides  a  general  solution  for  the 
mixed  media  access.  This  means  that  a  single  metadata  query  can  retrieve  information 
from  data  that  pertain  to  different  media  since  the  same  mechanism  is  used  to 
represent  their  content.  Moreover,  the  first  annotation  step  is  quite  useful  in 
supporting  a  similarity-based  indexing.  Indeed,  by  associating  similar  captions  to 
different  documents  we  make  them  „similar“  from  the  point  of  view  of  the  content 
and  therefore  of  the  retrieval. 

In  designing  an  architecture  supporting  the  approach  described  above,  the 
component  dealing  with  the  storage  and  the  management  of  all  the  types  of 
knowledge  (documents,  templates,  concepts,  and  conceptual  annotations)  on 
secondary  storage  plays  a  fundamental  role,  since  its  implementation  strongly 
influences  the  performance  of  the  overall  system.  The  aim  of  this  paper  is  that  of 
presenting  a  proposal  for  designing  and  implementing  such  component,  that  we  call 
Knowledge  Manager.  For  this  task,  we  have  followed  a  Web-Based  approach.  In 
particular,  the  Knowledge  Manager  has  been  implemented  as  a  true  server  manager 
that  can  be  hosted  on  a  generic  machine  connected  over  Intranet/Internet  networks  to 
the  clients  requiring  such  services.  The  advantage  of  this  approach  is  that  the  software 
component  we  have  designed  can  be  easily  used  by  other  architectures,  based  on  the 
use  of  NKRL  or  similar  languages  for  encoding  conceptual  annotations. 
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The  paper  is  organized  as  follows.  Section  2  introduced  NKRL  whereas  Section  3 
introduces  an  approach  for  the  internal  representation  of  such  language.  The 
Knowledge  Manager  architecture  is  then  presented  in  Section  4.  Finally,  Section  5 
presents  some  concluding  remarks. 


NKRL  as  a  Metalanguage  for  Document  Annotations 

In  the  following,  we  briefly  review  the  basic  characteristics  of  NKRL  ( Narrative 
Knowledge  Representation  Language)  (we  refer  the  reader  to  [7],  [8],  [10]  for 
additional  details). 

The  core  of  NKRL  consists  of  a  set  of  general  representation  tools  that  are 
structured  into  four  integrated  components,  described  in  the  following. 

Definitional  and  enumerative  components.  The  definitional  component  of  NKRL 
supplies  the  tools  for  representing  the  important  notions  ( concepts )  of  a  given  domain. 
In  NKRL,  a  concept  is,  therefore,  a  definitional  data  structure  associated  with  a 
symbolic  label  like  human_being ,  city_,  etc.  Concepts  ( definitional 
component)  and  individuals  ( enumerative  component)  are  represented  essentially  as 
frame-based  structures.  All  NKRL  concepts  are  inserted  into  a 
generalization/specialization  hierarchy  that  corresponds  to  the  usual  ontology  of  terms 
and  is  called  H_CLASS(es). 

The  enumerative  component  of  NKRL  concerns  the  formal  representation  of  the 
instances  (individuals)  (lucy_,  wardrobe_23)  of  the  concepts  of  H_CLASS. 
Throughout  this  paper,  we  will  use  the  italic  type  style  to  represent  a  concept_  and 
the  roman  style  to  represent  an  individuals 

Descriptive  and  factual  components.  The  dynamic  processes  describing  the 
interactions  among  the  concepts  and  individuals  in  a  given  domain  are  represented  by 
making  use  of  the  descriptive  and  factual  components.  The  descriptive  component 
concerns  the  tools  used  to  produce  the  formal  representations  (predicative  templates 
or  simply  templates)  of  general  classes  of  narrative  events,  like  ‘moving  a  generic 
object’,  ‘formulate  a  need’,  ‘be  present  somewhere’.  In  contrast  to  the  binary  structure 
used  for  concepts  and  individuals,  templates  are  characterized  by  a  threefold  format 
where  the  central  piece  is  a  predicate,  i.e.,  a  named  relation  that  exists  among  one  or 
more  arguments  introduced  by  means  of  roles.  The  general  format  of  a  predicative 
template  is  therefore  the  following: 

(P,  (Ri  ai)  (R2  a2)...  (R„  an)) 

In  the  previous  expression,  P(  denotes  the  symbolic  label  identifying  the  predicative 
template,  Rk,  k  =  l,...,n,  denote  generic  roles,  and  ak,  k  =  l,...,n,  denote  the  role 
arguments.  The  predicates  pertain  to  the  set  BEHAVE,  EXIST,  EXPERIENCE, 
MOVE,  OWN,  PRODUCE,  RECEIVE,  and  the  roles  to  the  set  SUBJ(ect),  OBJ(ect), 
SOURCE,  BEN(e)F(iciary),  MODAL(ity),  TOPIC,  CONTEXT.  Templates  are 
structured  into  an  inheritance  hierarchy,  H_TEMP(lates),  which  corresponds  to  a 
taxonomy  (ontology)  of  events.  The  instances  (predicative  occurrences)  of  the 
predicative  templates,  i.e.,  the  representation  of  single,  specific  events  like 
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..Tomorrow,  I  will  move  the  wardrobe"  or  ..Lucy  was  looking  for  a  taxi"  are  in  the 
domain  of  th e.  factual  component. 

Example  1.  The  NKRL  sentence  presented  in  Figure  1  codes  an  information  like: 
„On  April  5th,  1982,  Gordon  Pym  is  appointed  Foreign  Secretary  by  Margaret 
Thatcher",  that  can  be  directly  found  in  a  textual  document,  contained  in  an  historical 
digital  library.  The  subject  of  this  event  is  Gordon  Pym,  represented  as  a 
particular  instance  (gordon_pym)  of  the  concept  individual_person.  The 
object  of  this  event  is  the  position  Gordon  Pym  is  appointed  to,  represented  by  the 
concept  foreign_secretary_jpos.  Finally,  the  source  of  this  event  is  Margaret 
Thatcher  (represented  by  the  instance  margaret_thatcher)  since  she  is 
responsible  for  the  event.  In  the  predicative  occurrence,  temporal  information  is 
represented  through  two  temporal  attributes,  date-1  and  date-2.  They  define  the 
time  interval  in  which  the  meaning  represented  by  the  predicative  occurrence  holds. 
In  cl,  this  interval  is  reduced  to  a  point  on  the  time  axis,  as  indicated  by  the  single 
value,  the  timestamp  5-april-82,  associated  with  the  temporal  attribute  date-1; 
this  point  represents  the  beginning  of  an  event  because  of  the  presence  of  begin  (a 
temporal  modulator). 

cl)  OWN  SUBJ  gordon_pym 

OBJ  foreign_secretary__pos 
SOURCE  margaret_thatcher 
[begin] 

date-1:  (5-april-82) 
date-2 : 

Fig.  1.  Annotation  of  a  WWW  textual  document 

In  the  previous  example,  the  arguments  associated  with  roles  are  simple.  However, 
NKRL  also  provides  a  specialized  sublanguage,  AECS,  supporting  the  construction  of 
structured  arguments  by  using  four  operators:  the  disjunctive  (ALTERNative) 
operator,  the  distributive  (ENUMerative)  operator,  the  collective  (COORDination) 
operator,  and  the  attributive  (SPECIFication)  operator. 

Predicative  occurrences  can  also  be  combined  together,  through  the  use  of  specific 
second  order  structures,  called  binding  occurrences.  Each  binding  occurrence  is 
composed  of  a  binding  operator  and  a  list  of  predicative  or  binding  occurrences, 
representing  its  arguments.  Each  document  (NL  caption,  in  the  considered 
framework)  is  then  associated  with  a  single  conceptual  annotation,  corresponding  to 
the  binding  occurrence  representing  its  semantic  content. 

In  order  to  query  NKRL  occurrences,  search  patterns  have  to  be  used.  Search 
patterns  are  NKRL  data  structures  representing  the  general  framework  of  information 
to  be  searched  for,  within  the  overall  set  of  conceptual  annotations.  A  search  pattern  is 
a  data  structure  including,  at  least,  a  predicate,  a  predicative  role  with  its  associated 
argument,  where  it  is  possible  to  make  use  of  explicit  variables,  and,  possibly,  the 
indication  of  the  temporal  interval  where  the  unification  holds.  As  an  example,  the 
conceptual  annotation  in  Figure  1  can  be  successfully  unified  with  a  search  pattern 
like:  „When  was  Gordon  Pym  appointed  Foreign  Secretary?",  presented  in  Figure  2. 
The  variable  ?x  means  that  we  want  to  know  the  instant  when  the  event  happened. 

We  refer  the  reader  to  [7],  [8]  for  additional  details  on  these  topics. 
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(?w  IS- PRED -OCCURRENCE 
: predicate  OWN 
:  SUBJ  gordon_pym 
:date-l  ?x 

Fig.  2.  A  simple  example  of  an  NKRL  search  pattern 


A  Representation  Language  for  NKRL 

The  usual  way  of  implementing  NKRL  has  been,  until  recently,  that  of  making  use  of 
a  three-layered  approach:  Common  Lisp  +  a  frame/object  oriented  environment  (e.g., 
CRL,  Carnegie  Representation  Language,  in  the  NOMOS  project)  +  NKRL.  In  order 
to  ensure  a  high  level  of  standardization,  we  are  now  realizing,  in  the  context  of  the 
CONCERTO  project  [9],  a  new  version  of  NKRL,  implemented  in  Java  and  RDF- 
compliant  (RDF  =  Resource  Description  Format)  [5]. 

RDF  is  a  proposal  for  defining  and  processing  WWW  metadata  that  is  developed 
by  a  specific  W3C  Working  Group  (W3C  =  World  Wide  Web  Consortium).  The 
model,  implemented  in  XML  (extensible  Markup  Language),  makes  use  of  Directed 
Labeled  Graphs  (DLGs)  where  the  nodes,  that  represent  any  possible  Web  resource 
(documents,  parts  of  documents,  collections  of  documents,  etc.)  are  described 
basically  by  using  attributes  that  give  the  named  properties  of  the  resources.  No 
predefined  ‘vocabulary’  (ontologies,  keywords,  etc.)  is  in  itself  a  part  of  the  proposal. 
The  values  of  the  attributes  may  be  text  strings,  numbers,  or  other  resources.  In  the 
last  versions  of  the  RDF  Model  and  Syntax  Specifications  new,  very  interesting, 
constructs  have  been  added  [5].  Among  them,  of  particular  interest  are  the 
‘containers’,  i.e.,  tools  for  describing  collections  of  resources.  In  an  NKRL  context 
the  containers  are  used  to  represent  the  structured  arguments  created  by  making  use  of 
the  operators  of  the  AECS  sublanguage  (see  Section  2). 

A  first,  general  problem  to  be  solved  to  set  up  an  RDF-compliant  version  of  NKRL 
has  concerned  the  very  different  nature  of  the  RDF  and  NKRL  data  structures.  The 
first  are  ‘binary’  ones,  i.e.,  based  on  the  usual  organization  into  ‘attribute  -  value’ 
pairs.  The  second  are  ‘tripartite’,  i.e.,  are  organized  around  a  ‘predicate’,  whose 
‘arguments’  are  introduced  through  a  third,  functional  element,  the  ‘role’.  To  provide 
the  conversion  into  RDF  format,  the  NKRL  data  structures  have  been  represented  as 
intertwined  binary  ‘tables’  that  describes  the  RDF-compliant,  general  structure  of  an 
NKRL  template. 

Example  3.  Consider  the  predicative  occurrence  presented  in  Figure  1.  The 
RDF/XML  description  of  cl  is  presented  in  Figure  3.  In  general,  the  RDF  text 
associated  with  each  predicative  occurrence  is  composed  of  several  tags,  all  nested 
inside  the  <CONCEPTUAL_ANNOTATION>  tag  and  belonging  to  two  different 
namespaces:  rdf  and  ca.  The  first  namespace  describes  the  standard  environment 
under  which  RDF  tags  are  interpreted.  The  second  namespace  describes  specific  tags 
defined  in  the  context  of  our  specific  application.  More  precisely,  the  tag 
<ca :  Template_i>  is  used  to  specify  that  the  predicative  occurrence  is  an  instance 
of  the  template  identified  by  Template_i.  The  identifier  of  the  occurrence  is  an 
attribute  of  such  tag  (occll824  in  our  example).  The  other  tags  specify  the  various 
roles  of  the  predicative  occurrence,  together  with  the  associated  arguments. 
Additional  tags  are  used  to  represent  temporal  information  and  modulators. 
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The  Knowledge  Manager  Architecture 

Four  main  modules  compose  the  architecture  supporting  our  approach: 

<?xml  version=" 1 . 0 "  ?> 

< ! DOCTYPE  DOCUMENTS  SYSTEM  * C A_RDF . dtd " > 

<  CONCE  PTUAL_ANNOTATI  ON> 

<rdf :RDF  xmlns : rdf = "http: //www. w3 . org/ 1999/02/22 -rdf -syntax-ns#" 
xmlns : ca=“http: //projects .pira . co .uk/ concerto#  ■> 

<rdf  :  Description  about=  ''occll824“> 

<rdf : type  resource="ca:Occurrence“/> 

<ca: instanceOf>Template43</ca: instanceOf> 
<ca:predicateName>Own</ca:predicateName> 

<ca:subject  rdf : ID="Subj43 *  rdf :parseType="Resource”> 

<ca: f iller>gordonjpym</ca: filler> 

</ca: subject> 

<ca:object  rdf : ID="Obj43"  rdf :parseType="Resource"> 

<ca:  f iller>foreign_secretary_joos<ca:  filler> 

</ca:object> 

<ca:source  rdf : ID="Source43 "  rdf :parseType="Resource"> 

<ca: f iller>margaret_thatcher</ca : f iller> 

</ca : source> 

<ca: listOfModulators> 

<rdf :  Seqxrdf :  li>begin</rdf :  lix/rdf :  Seq> 

</ca: listOfModulators> 

<ca:datel>05/04/1982</ca:datel> 

</rdf :Description> 

</rdf : RDF> 

</CONCEPTUAL__ANNOTATION> 

Fig.  3.  The  RDF  format  of  a  predicative  occurrence 

•  Acquisition  module,  providing  a  user-friendly  interface  by  which  the  user  can  insert 
documents  and  associate  with  them  some  short  NL  captions. 

•  Annotation  module,  that  is  in  charge  of  the  translation  of  the  NL  captions  into  the 
NKRL  format. 

•  Knowledge  Manager  module,  implementing  the  basic  features  for  storing  and 
managing  NKRL  concepts,  templates,  original  documents,  and  the  associated 
conceptual  annotations  on  persistent  storage. 

•  Query  module,  applying  sophisticated  mechanisms  to  retrieve  all  documents 
satisfying  certain  user  criteria,  by  using  conceptual  annotations. 

In  the  context  of  the  proposed  architecture,  the  Knowledge  Manager  plays  a 
fundamental  role.  Indeed,  since  it  manages  the  repositories  on  secondary  storage,  its 
implementation  strongly  influences  the  performance  of  the  overall  system.  In  the 
current  architecture,  the  Knowledge  Manager  has  been  implemented  as  a  server, 
following  a  Web-based  approach,  by  using  Internet  derived  technologies  for  the 
communication  protocol  and  metadata  representation.  In  particular,  the  Knowledge 
Manager  is  organized  according  to  a  three-tier  architecture,  represented  in  Figure  4. 
The  first  level  corresponds  to  the  repository  management  on  persistent  storage, 
through  the  use  of  a  specific  database  management  system  (IBM  DB2  in  our  case); 
the  second  level  is  an  application  level,  providing  an  easy  programming  interface 
(through  a  Java  API)  to  the  repository.  Finally,  the  third  level  consists  of  a  specific 
interface  language  (called  KMIL)  to  provide  access  to  the  Knowledge  Manager 
through  a  Web-Based  approach.  In  the  following,  the  repositories  and  their 
management  as  well  as  the  communication  protocol  are  described  in  more  details. 
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Level  3 


Level  2 


Level  1 


Fig.  4.  General  architecture  of  the  Knowledge  Manager 


The  Repositories  and  Their  Management 

In  order  to  deal  with  NKRL  data  structures,  we  designed  three  distinct  but  interrelated 
repositories.  The  first  repository  is  the  Document  Repository,  storing  the  original 
documents,  together  with  the  corresponding  NL  captions.  In  order  to  deal  with 
conceptual  annotations,  the  H_TEMP  and  H_CLASS  ontologies  are  stored  in  the 
Ontology  Repository.  The  concrete  conceptual  annotations,  generated  by  the 
Annotation  Module,  are  then  stored  in  the  Conceptual  Annotation  Repository. 

The  Conceptual  Annotation  Repository  is  certainly  the  most  critical  one  since  user 
queries  are  executed  against  it.  It  contains  two  main  types  of  data:  predicative 
occurrences  and  binding  occurrences.  Each  predicative  occurrence  is  characterized, 
among  the  others,  by  its  XML/RDF  text  and  the  identifier  of  the  template  it  is  an 
instance  of.  For  each  template,  we  also  maintain  the  set  of  predicative  occurrences 
representing  the  leaves  of  the  subtree  rooted  by  it  in  the  H_TEMP.  The  use  of  this 
information  optimizes  query  processing  since  a  search  pattern  always  selects  a  set  of 
predicative  occurrences  that  are  instances  of  a  single  template.  Each  binding 
occurrence  is  internally  characterized,  among  the  others,  by  the  binding  operator  and 
the  identifiers  of  its  arguments  (i.e.,  binding  or  predicative  occurrences). 

Each  document  is  then  associated  with  a  single  conceptual  annotation,  arbitrarily 
complex,  describing  atomic  information,  through  the  use  of  predicative  occurrences, 
and  combined  information,  through  the  use  of  binding  occurrences.  The  repository 
maintains  the  relationship  between  documents  and  the  associated  conceptual 
annotations.  It  is  important  to  note  that,  to  guarantee  a  high  level  of  flexibility,  we 
assume  that  each  occurrence  can  be  associated  with  different  documents.  This 
corresponds  to  the  situation  in  which  different  documents  refer  similar  or  equal  events 
or  contain  similar  or  equal  images  or  sound. 

Since  RDF  can  be  implemented  by  using  XML,  in  order  to  store  conceptual 
annotations  and  templates,  we  choose  IBM  DB2  Universal  Database  together  with  the 
XML  extender,  recently  released  by  IBM.  The  repositories  are  then  managed  through 
the  use  of  a  Java  API,  implementing  specific  operation  to  be  executed  against  the 
repositories.  Each  operation,  before  execution,  is  translated  into  some  SQL  commands 
to  be  executed  by  DB2.  The  use  of  a  Java  API  provides  a  high  level  of  portability  for 
the  system  we  have  developed.  Moreover,  since  several  packages  for  implementing  an 
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XML  parser  in  Java  are  currently  available,  this  choice  fits  well  in  the  overall  system 
architecture.  Among  the  supported  operations,  queries  against  the  Conceptual 
Annotation  Repository  intensively  use  the  functionalities  supported  by  IBM  DB2  and 
IBM  DB2  XML  Extender  to  retrieve  predicative  occurrences  starting  from  given 
selection  conditions. 

The  Communication  Protocol  and  the  Interface  Language 

The  Knowledge  Manager  services  can  be  executed  under  two  different  modalities  (see 
Figure  4).  In  a  local  environment,  the  Java  API  operations  are  directly  called  and 
executed.  In  a  remote  environment,  communication  is  performed  through  the  HTTP 
protocol.  The  use  of  HTTP  guarantees  an  efficient  access  to  the  Knowledge  Manager 
from  any  software  module  located  at  any  site  on  the  Internet.  In  order  to  guarantee  a 
standard  communication  between  modules,  services  have  to  be  expressed  by  means  of 
an  XML  document.  Such  document  has  to  be  constructed  according  to  a  specific 
XML  language,  called  Knowledge  Manager  Interface  Language  (KMIL).  KMIL 
requests  can  be  sent  by  using  an  HTTP  post  action  to  a  Knowledge  Manager  front-end 
Servlet  running  under  a  specific  HTTP  servlet  engine.  This  solution  has  the  advantage 
that  the  Knowledge  Manager  can  be  hosted  on  a  generic  machine,  becoming  strongly 
independent  from  other  modules  of  the  architecture.  All  requests  sent  to  the 
Knowledge  Manager  are  then  captured  by  a  Web  Server  that  activates  a  specific  Java 
Servlet  for  the  execution  of  the  requested  services,  through  the  use  of  the  Java  API,  on 
the  underlying  DBMS.  As  a  result,  an  XML  document  containing  the  result  of  the 
computation  is  returned  to  the  calling  module. 

Example  4.  Suppose  that  the  conceptual  annotation  of  Figure  1  has  to  be  inserted  into 
the  Conceptual  Annotation  Repository.  This  can  be  specified  by  using  the  KMIL 
document  presented  in  Figure  5.  Such  document  contains  a  <KMIL-ACTION>  tag  for 
the  document  and  the  predicative  occurrence  that  have  to  be  inserted,  respectively, 
together  with  all  the  required  information.  This  information  is  then  used  to 
consistently  update  the  content  of  the  Conceptual  Annotation  and  Document 
repositories. 

<?xml  version®" 1 . 0" ?> 

<  !  DOCTYPE  KMIL- SESSION  SYSTEM  "Kmilln .  dtd"  > 

<KMIL-SESSION> 

< KMIL -ACTION  serial_number= " 1 " > 

<KMIL-INSERT-Document  IdDoc= "docl32 "  > 

<TEXT> 

On  April  5th,  1982,  Gordon  Pym  is  appointed  Foreign 
Secretary  by  Margaret  Thatcher 
</TEXT> 

</KMIL-INSERT-Document> 

</KMIL~ACTION> 

<KMIL-ACTION  serial_number= " 2 " > 

<KMIL-INSERT-PredOcc  IdPO= "occll824 "  Doc= "docl32 " > 

<TEXT>  RDF  Text  </TEXT> 

</KMIL-INSERT-PredOcc> 

</KMIL-ACTION> 

</KMIL-SESSION> 


Fig.  5.  Example  of  a  KMIL  request 
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Concluding  Remarks 

In  this  paper  we  have  presented  an  approach  for  indexing  and  retrieving  multimedia 
digital  documents  through  the  use  of  conceptual  annotations,  describing  in  details  the 
component  entrusted  with  the  management  of  documents  and  conceptual  annotations 
in  secondary  storage.  The  techniques  presented  in  this  paper  are  now  being  exploited 
in  the  framework  of  the  Esprit  project  CONCERTO  ( CONCEptual  indexing,  querying 
and  ReTrieval  Of  digital  documents.  Esprit  29159)  [9].  The  aim  of  such  project  is  to 
improve  current  techniques  for  indexing,  querying  and  retrieving  textual  documents, 
mainly  concerning  the  socio-economical  and  the  biotechnology  contexts.  Future  work 
includes  the  definition  of  specialized  techniques  for  storing  and  indexing  conceptual 
annotations.  In  particular,  disk  placement  and  caching  techniques  for  conceptual 
annotations  are  currently  under  investigation  in  order  to  improve  the  performance  of 
the  system. 
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Abstract.  We  investigate  logic-based  query  language  for  semistructured 
data,  that  is  data  having  irregular,  partial  or  only  implicit  structure.  A 
typical  example  is  the  data  found  on  the  Web.  We  present  the  syntax  and 
semantics  of  SemLog,  a  logic  for  querying  and  restructuring  semistruc¬ 
tured  data,  and  show  how  this  language  can  be  used  to  query  video  data. 

Keywords:  Intelligent  information  retrieval,  Logic  for  databases,  Inte¬ 
grating  navigation  and  search,  Video  retrieval. 


1  Introduction 

Semistructured  data  models  are  intended  to  capture  data  that  are  not  intentionally 
structured,  that  are  structured  heterogeneously,  or  that  evolve  so  quickly  that  the 
changes  cannot  be  reflected  in  the  structure.  A  typical  example  is  the  World-Wide 
Web  with  its  HTML  pages,  text  files,  bibliographies,  biological  databases,  etc.  A  semi¬ 
structured  database  essentially  consists  of  objects,  which  are  linked  to  each  other  by 
attributes. 

Semistructured  Data  represent  a  particularly  interesting  domain  for  query  languages. 
Computations  over  semistructured  data  can  easily  become  infinite,  even  when  the  un¬ 
derlying  alphabet  is  finite.  This  is  because  the  use  of  path  expressions  (i.e.,  compositions 
of  labels)  is  allowed,  so  that  the  number  of  possible  paths  over  any  finite  alphabet  is  in¬ 
finite.  Query  languages  for  semistructured  data  have  been  recently  investigated  mainly 
in  the  context  of  algebraic  programming  [2,4], 

In  this  paper,  we  explore  a  different  approach  to  the  problem,  an  approach  based  on 
logic  programming,  instead  of  algebraic  programming.  In  particular,  we  develop  an 
extension  of  Datalog 1  for  manipulating  semistructured  data.  It  has  both  a  clear  decla¬ 
rative  semantics  and  an  operational  semantics.  The  semantics  are  based  on  fixpoint 
theory,  as  in  classical  Logic  programming  [9],  The  language  of  terms  uses  five  coun¬ 
table,  disjoint  sets:  a  set  of  atomic  values  (Pi),  a  set  of  objects  (P2),  a  set  of  labels 
(P3),  a  set  of  object  variables  (V),  and  a  set  of  path  variables  (V).  A  path  variable  is 

1  Database  logic 
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a  variable  ranging  over  paths.  The  universe  of  paths  over  ©3  is  infinite.  Thus,  to  keep 
the  semantics  of  programs  finite,  we  do  not  evaluate  rules  over  the  entire  universe, 
D3,  but  on  a  specific  active  domain.  We  define  the  active  domain  of  a  database  to 
be  the  set  of  constants  (objects  and  labels)  occurring  in  the  database.  We  then  define 
the  extended  active  domain  to  include  both  the  constants  in  the  active  domain  and  all 
path  expressions  resulting  from  the  composition  of  labels  in  the  active  domain.  The 
semantics  of  our  language  is  defined  with  respect  to  the  extended  active  domain.  In 
particular,  substitutions  range  over  this  domain  when  rules  are  evaluated. 

The  extended  active  domain  is  not  fixed  during  query  evaluation.  Instead,  whenever  a 
new  path  expression  is  created,  the  new  path  and  all  paths  resulting  from  its  concate¬ 
nation  with  already  existing  paths  are  added  to  the  extended  active  domain. 

Paper  outline:  In  Section  2,  we  introduce  the  data  model.  In  Section  3,  we  develop 
our  language  and  give  its  syntax  and  semantics.  Section  4  provides  an  application 
example.  We  conclude  in  Section  5  by  anticipating  on  the  necessary  extensions. 


2  Data  Model 

Recent  research  works  propose  to  model  semistructured  data  using  ’’lightweight”  data 
models  based  on  labeled  directed  graphs  [4,1].  Informally,  the  vertices  in  such  graphs 
represent  objects  and  the  labels  on  the  edges  convey  semantic  information  about  the 
relationship  between  objects.  The  vertices  without  outgoing  edges  (sink  nodes)  in  the 
graph  represent  atomic  objects  and  have  values  associated  with  them.  The  other  ver¬ 
tices  represent  complex  objects.  An  example  of  a  semistructured  database  in  the  style 
of  OEM  [13]  is  given  figure  1. 


orgt 


Fig.  1.  Example  Semistructured  Data 

Path  expressions  describe  path  along  the  graph,  and  can  be  viewed  as  compositions  of 
labels.  For  example,  the  expression 


Residence. City 

describes  a  path  that  starts  in  an  object,  continues  to  the  residence  of  that  object,  and 
ends  in  the  city  of  that  residence. 

In  this  paper,  we  assume  that  the  usual  base  types  String,  Integer,  Real,  etc.,  are  avai¬ 
lable.  In  addition,  we  shall  use  a  new  type  Feature  for  labels  that  would  correspond 
to  attribute  names.  We  write  numbers  and  features  literally  (the  letter  usually  capi¬ 
talized)  and  use  quotation  marks  for  strings,  e.g.,  ’’car”.  In  what  follows  we  make  the 
simplifying  assumption  that  labels  can  be  symbols,  strings,  integers,  etc;  in  fact,  the 
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type  of  labels  is  just  the  discriminated  union  of  these  base  types.  In  addition,  we  con¬ 
sider  only  semistructured  data  whose  graph  is  acyclic. 

Like  [11]  we  represent  the  graph  using  two  base  relations: 

—  link(FromObj,  ToObj,  Label):  This  relation  contains  all  the  edge  information.  For 
instance,  link{org\ ,  e2,  Employee)  refers  to  an  edge  labeled  Employee  from  the 
object  org\  to  the  object  e2.  There  may  be  more  than  one  edge  from  org i  to  e2. 

—  atomic(Obj,  Value):  It  contains  all  the  value  information.  For  instance,  the  fact 
atomic(n\,”U liman ”)  states  that  object  m  is  atomic  and  has  value  Ullman. 

We  assume  that  each  atomic  object  has  exactly  one  value,  and  each  atomic  object  has 
no  outgoing  edges.  We  consider  that  the  data  comes  in  as  an  instance  over  link  and 
atomic  satisfying  these  two  conditions.  We  use  the  term  database  in  the  following  for 
such  a  data  set. 

Let  p  be  a  path  expression  of  the  form  a  i .  02  ...n„,  where  each  a,  is  a  label. 

Then,  link(oi,  02,  p)  holds  if  there  are  objects  0\, . . . ,  on_1  such  that  linkfoi,  o1 ,  ai), 
. . .,  link(on-i,  02,  a„)  hold. 

3  The  Language 

This  section  develops  the  syntax  and  semantics  of  the  language  for  specifying  programs 
over  semistructured  data. 


3.1  Syntax 

The  language  of  terms  uses  three  countable,  pair-wise  disjoint  sets: 

1.  A  set  V  of  constant  symbols.  This  set  is  the  union  of  three  pair-wise  disjoint  sets: 

—  V\ :  a  set  of  atomic  values 
—  T> 2:  a  set  of  entities,  also  called  object  entities 
—  V3 :  a  set  of  labels 

2.  A  set  V  of  variables  called  object  variables  and  value  variables,  and  denoted 

3.  A  set  V  of  variables  called  path  variables,  and  denoted  a,/3,... 


Definition  1.  (Predicate  Symbol)  we  define  the  following  predicate  symbols: 

—  The  predicate  symbol  link  with  arity  3 

—  The  binary  predicate  symbol  atomic 

—  The  user  specified  intentional  predicates  ( ordinary 2  predicates ) 

We  model  semistructured  data  by  a  program  P  which  contains,  besides  the  set  of  facts 
built  from  link  and  atomic,  the  following  rule  : 

link(X,Y,a./3)  :  —link(X,Z,a),  link{Z,Y,  f3) 

Ordinary  predicates  can  be  of  any  arity. 
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This  rule  says  that  if  an  object  Z  can  be  reached  from  an  object  X  by  a  path  a  and 
from  Z  we  can  reach  an  object  Y  by  a  path  /?,  then  there  is  a  path  a./3  from  X  to  Y. 

Other  ordinary  facts  can  be  specified  by  rules  of  the  form: 


for  some  n  >  0,  where  X  ,Yi, ...  ,Yn  are  tuples  of  variables  or  constants.  We  require 
that  the  rules  are  safe,  i.e.,  a  variable  that  appears  in  X  must  also  appear  in  YiU. .  .UY„. 
The  predicates  L\, . . .  ,Ln  may  be  either  link  or  atomic,  or  ordinary  predicates.  In  the 
following,  we  use  the  term  (positive)  atoms  to  make  reference  to  Li, . . .  ,Ln- 

Example  1.  Figure  2  shows  a  fragment  of  semistructured  data  of  figure  1. 


atomic(o\ ,  25) 
atomic(ri2 ,  ’’Turing”) 
atomic(li,”  BatA”) 


link{org\ ,  ei ,  Employee) 
link(orgi ,  di ,  Dept ) 
link(e\ ,  n\ ,  Name ) 
link[d\ ,  ri3 ,  Name) 
link{d\,e\,  Head) 


Fig.  2.  A  set  of  facts 

The  intensional  part  of  the  program  P  contains  the  single  rule  : 

link(X,Y,a.fd)  :  —  link(X ,  Z ,  a) ,  link(Z,Y,  (3) 


Given  the  semistructured  data  of  figure  1,  the  query: 

Answer (X)  :  —link(org\,Y, a. Author. Name),  atom(Y,X) 

returns  all  the  author  names,  reachable  from  the  root  object  orgi  by  any  path  having 
as  prefix  any  path  (here  given  by  the  (possible)  value  of  the  path  variable  a),  and  as 
suffix  the  path  Author. Name. 


3.2  Semantics 

Our  language  has  a  declarative  model-theoretic  and  fixpoint  semantics. 


Model-theoretic  semantics.  Recall  that  V  denotes  a  set  of  variables  called  object 
and  value  variables,  and  V  denotes  a  set  of  variables  called  path  variables.  Let  V  = 
VUF. 

Let  var'i  be  a  countable  function  that  assigns  to  each  syntactical  expression  a  subset 
of  V  corresponding  to  the  set  of  object  and  value  variables  occurring  in  the  expression, 
and  var 2  be  a  countable  function  that  assigns  to  each  expression  a  subset  of  V  corre¬ 
sponding  to  the  set  of  path  variables  occurring  in  the  expression. 

Let  var  =  vari  U  var 2.  If  E\ , . . . ,  E„  are  syntactic  expressions,  then  var(E\ , . . . ,  En) 
is  an  abbreviation  for  var(Ei)  U  . . .  U  var(En). 

A  ground  atom  A  is  an  atom  for  which  var(A)  =  0.  A  ground  rule  is  a  rule  r  for  which 
var(r)  =  0. 


Logic-Based  Approach  to  Semistructured  Data  Retrieval 


81 


Definition  2.  (Extension)  Given  the  set  V3  of  labels,  the  extension  of  V 3,  written 
VT,  is  the  set  of  path  expressions  containing  the  following  elements: 

—  each  element  in  V 3 

—  for  each  ordered  pair  pi,p2  of  elements  ofV%xt,  the  element  p\.p2 


Definition  3.  (Extended  Active  Domain)  The  active  domain  of  an  interpretation 
T,  noted  T>x  is  the  set  of  elements  appearing  inZ,  that  is,  a  subset  ofVi  UV2  UV3.  The 
extended  active  domain  ofZ,  denoted  DJ1*,  is  the  extension  ofDi,  that  is,  a  subset  of 
£>iU£>2UX>r‘. 


Definition  4.  (Interpretation)  Given  a  program  P,  an  interpretation  Z  of  P  con¬ 
sists  of: 

—  A  domain  T> 

—  A  mapping  from  each  constant  symbol  in  P  to  an  element  of  domain  T> 

—  A  mapping  from  each  n-ary  predicate  symbol  in  P  to  a  relation  in  (Vext)n 


Definition  5.  (Valuation)  A  valuation  v\  is  a  total  function  from  V  to  the  set  of 
elements  T>\  U  T>i .  A  valuation  v2  is  a  total  function  from  V  to  the  set  of  elements 
V lxt .  Let  v  =  vi  Uv2.  v  is  extended  to  be  identity  on  V  and  then  extended  to  map  free 
tuples  to  tuples  in  a  natural  fashion. 


Definition  6.  (Atom  Satisfaction)  Let  Z  be  an  interpretation.  A  ground  atom  L  is 
satisfiable  inZ  if  L  is  present  in  Z. 


Definition  7.  (Rule  Satisfaction)  Let  r  be  a  rule  of  the  form  : 

r  :  A  <—  Li , ,  Ln 

where  L\, . . . ,  Ln  are  (positive)  atoms.  Let  Z  be  an  interpretation,  and  v  be  a  valuation 
that  maps  all  variables  of  r  to  elements  of  Vjx  .  The  rule  r  is  said  to  be  true  (or 
satisfied)  in  interpretation  Z  for  valuation  v  if  v[A\  is  present  in  Z  whenever  each 
v[Li],i  G  [l,n]  is  satisfiable  inZ. 


Fixpoint  Semantics.  The  fixpoint  semantics  is  defined  in  terms  of  an  immediate 
consequence  operator,  Tp,  that  maps  interpretations  to  interpretations.  An  interpre¬ 
tation  of  a  program  is  any  subset  of  all  ground  atomic  formulas  built  from  predicate 
symbols  in  the  language  and  elements  in  Vext .  Each  application  of  the  operator  Tp  may 
create  new  atoms.  We  show  below  that  Tp  is  monotonic  and  continuous.  Hence,  it 
has  a  least  fixpoint  that  can  be  computed  in  a  bottom-up  iterative  fashion. 

Recall  that  the  language  of  terms  has  three  countable  disjoint  sets:  a  set  of  atomic 
values  (  D 1),  a  set  of  entities  (  Vi),  and  a  set  of  labels  (  Vf).  A  path  expression  is  an 
element  of  V%xt .  We  define  Vext  =V1UV2U  V%xt . 

Lemma  1.  IfZ \  and  Z2  are  two  interpretations  such  that  Z\  CZ2,  then  V C  V%£ . 
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Definition  8.  (Immediate  Consequence  Operator)  Let  P  be  a  program  and  X 
an  interpretation.  A  ground  atom  A  is  an  immediate  consequence  for  X  and  P  if  either 
A  €  X,  or  there  exists  a  rule  r  :  H  <—  L\ , . . . ,  Ln  in  P,  and  there  exists  a  valuation  v, 
based  on  Vjf1 ,  such  that: 

—  A  —  v(H),  and 

—  Vi  6  [l,n],  v(Li )  is  satisfiable. 

Definition  9.  (T-Operator)  The  operator  Tp  associated  with  program  P  maps  in¬ 
terpretations  to  interpretations.  If  X  is  an  interpretation,  then  Tp(X)  is  the  following 
interpretation: 

Tp(X)  =  X  U  {A  |  A  is  an  immediate  consequence  for  X  and  P} 


Theorem  1.  (Continuity  &  monotonicity)  The  operator  Tp  is  continuous  and 
monotonic. 

In  the  following,  we  illustrate  the  use  of  our  query  language  for  video  data  retrieval. 


4  An  Example:  Video  Databases 

Digital  video  is  content-rich  information  carrying  media  of  massive  proportion.  In  fact, 
the  data  volume  of  video  is  about  seven  orders  of  magnitude  larger  than  a  structured 
data  record  [6].  Video  data  also  carries  temporal  and  spatial  information.  Moreover, 
the  structure  of  video  data  and  the  relationships  among  them  is  very  complex  and 
ill-defined.  These  unique  characteristics  pose  great  challenges  for  the  management  of 
video  data  in  order  to  provide  efficient  and  content-based  user  access.  One  of  the  main 
problems  is  that  defining  schema  information  for  some  video  data  in  advance  turns  out 
to  be  very  difficult  [5]  and  thus  a  semistructured  approach  has  to  be  considered. 

We  consider  two  layers  for  representing  video  content  (figure  3): 

(1)  Feature  &  Content  Layer.  It  contains  video  visual  features  (e.g.,  color,  shape, 
motion) .  This  layer  is  characterized  by  a  set  of  techniques  and  algorithms  allowing 
to  retrieve  video  sequences  based  on  the  similarity  of  visual  features. 

(2)  Semantic  Layer.  This  layer  contains  objects  of  interest,  their  descriptions,  and  re¬ 
lationships  among  objects  based  on  extracted  features.  Objects  in  a  video  sequence 
are  represented  in  the  semantic  layer  as  visual  entities.  Instances  of  visual  objects 
consist  of  conventional  attributes  (e.g.,  name,  actorlD,  date,  etc.). 


Semantic  Layer 

Objects,  their  descriptions,  mutual  relationships. ... 


Feature  &  Content  Layer 

Colors,  motions,  spatial  relationships. ... 
Algorithms  for  similarity  of physical  features 


Fig.  3.  Two  layers  for  video  content 
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Figure  4  shows  a  fragment  of  the  semantic  content  of  a  video  database.  Although  a 
real-world  video  database  would  of  course  be  much,  much  larger,  this  example  conci¬ 
sely  captures  the  sort  of  structure  (or  lack  thereof)  needed  to  illustrate  the  features 
of  our  language.  As  illustrated  by  figure  4,  the  structure  of  the  content  describing  a 
video  differs  from  a  category  to  another,  and  even  within  the  same  category.  Here, 
the  attribute  name  Frames  links  an  abstract  object  to  a  concrete  object  name  which 
denotes  the  name  of  the  image  sequences  stored  in  the  Feature  &  Content  Layer. 

It  is  easy  to  see  that  the  proposed  query  language  can  be  used  to  navigate  the  Semantic 
Layer  of  this  database.  In  the  following,  we  extend  the  language  to  accommodate  the 
Feature  &i  Content  Layer. 


arcblveenlry 


Fig.  4.  A  fragment  of  a  video  database  content 


Extension  to  Feature  &;  Content  Layer 

Video  data  in  the  Feature  &  Content  Layer  can  be  organized  in  a  hierarchy  of  units  with 
individual  frames  at  the  base  level,  and  higher  level  segments  such  as  shots,  scenes, 
and  episodes.  This  model  facilitates  querying  and  composition  at  different  levels  and 
thus  enables  a  very  rich  set  of  temporal  and  spatial  operations.  Examples  of  temporal 
operations  are  ’’follows”,  ’’contains”,  and  ’’transition”.  Examples  of  spatial  operations 
are  ’’parallel  to”  and  ’’below”. 

If  we  can  regard  a  piece  of  video  data  as  a  set  of  images,  then  Query-by-Example  me¬ 
thods  developed  for  images  (see,  for  example,  [8])  can  be  used  to  retrieve  video  data 
by  audiovisual  content.  For  example,  [12]  implemented  a  system  which  makes  retrieval 
of  video  data  possible  by  specifying  the  motion  of  an  object  observed  in  video  data  by 
giving  an  example.  An  example  of  an  object  motion  is  specified  by  making  a  mouse 
move,  and  then  a  trajectory  and  velocity  are  sampled  in  accordance  with  the  movement. 

In  the  following,  by  exploiting  this  notion  of  procedural  attachment  [10],  we  provide  an 
extension  of  our  language,  leading  to  a  rule-based,  constraint  query  language  for  video 
data  retrieval. 
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Definition  10.  (Rule)  A  rule  in  our  extended  language  has  the  form: 

V  .  H  i  L\  ,  .  .  .  ,  Ln  ,  .  .  .  ,  Cm 

where  H  is  an  atom,  n,m  >  0,  L\,...,Ln  are  ( positive )  literals,  and  c\,...,Cm  are 
constraints. 

Definition  11.  (Rule  Satisfaction)  Let  r  be  a  rule  of  the  form: 

r  ■■  A  <r-  Lx,..  ,,Ln&iCl,  ...,Cm 

L\,...,Ln  are  (positive)  atoms,  and  c\, . . .  ,cm  are  constraints.  Let 
tation,  and  v  be  a  valuation  that  maps  all  variables  of  r  to  elements 
r  is  said  to  be  true  (or  satisfied)  in  interpretation  Z  for  valuation  v 
in  Z  whenever: 

—  each  v(ci),  i  €  [l,m]  is  satisfiable,  and 

—  each  v[L(\,  i  6  [l,n]  is  satisfiable  in  Z. 

Given  the  previous  database  fragment,  the  query: 

q(Y)  linkfX.,  Y,  a. sequences),  link) Y,  Z,  duration), 

atomic{7i,  Z’),  linktfL,  Z”, frames),  atomictfL" ,  F)  & 

Z’  <  20,  similar-color(F,  fd) 

Would  be  ”Find  a  set  of  sequences,  with  a  duration  below  20,  and  the  video  clip  (i.e., 
the  filler  of  the  attribute  frames,  here  F)  of  each  sequence  in  the  answer  set  is  similar 
to  the  video  clip  fd  regarding  color” .  Here  fd  is  the  name  of  a  video  clip  stored  in  the 
Feature  &  Content  Layer,  similar-color  is  a  symbol  with  an  attachment.  The  attached 
program  will  be  executed  in  the  Feature  &  Content  Layer. 

5  Conclusion 

There  is  a  growing  interest  in  semistructured  data,  and  this  field  offers  many  new  chal¬ 
lenges  to  IR  research.  As  semistructured  data  proliferate,  aids  to  browsing  and  filtering 
become  increasingly  important  tools  for  managing  such  exponentially  growing  infor¬ 
mation  resources  and  for  dealing  with  access  problems.  We  believe  that  formal  settings 
will  help  understanding  related  modeling  and  querying  problems.  This  will  lead  to  the 
development  of  robust  systems  in  order  to  effectively  integrate,  retrieve  and  correlate 
semistructured  data. 

We  have  presented  a  logic-based  language  for  querying  semistructured  data,  given  its 
formal  semantics,  and  applied  it  to  video  data.  Several  interesting  directions  to  pursue: 

—  In  the  language  we  presented,  navigational  queries  are  expressed  using  variables 
ranging  over  path  expressions  in  the  graph  representing  the  data.  An  important 
aspect  to  be  considered  is  the  use  of  path  constraints  [3]  to  take  advantage  of  local 
knowledge  about  the  data  graph. 

—  An  important  and  critical  problem  is  the  discovery  of  the  structure  implicit  in  our 
video  data.  This  is  especially  important,  since  video  data  are  often  accessed  in  an 
explorative  or  browse  mode.  For  that,  it  may  be  useful  to  build  a  layer  of  classes 
on  top  of  our  data  model.  The  classes  can  be  defined  by  rules  and  populated  by 
computing  a  greatest  fixpoint  [11], 

—  Due  to  the  visual  nature  of  video  data,  a  user  may  be  interested  in  results  that  are 
similar  to  the  query.  Thus,  the  query  system  should  be  able  to  perform  exact  as 
well  as  partial  or  fuzzy  matching.  The  first  investigations  reported  in  [7]  constitute 
a  nice  basis. 


Z  be  an  interpre- 
ofVext.  The  rule 
if  v[A]  is  present 


Logic-Based  Approach  to  Semistructured  Data  Retrieval 


85 


References 

1.  Serge  Abiteboul.  Querying  Semi-Structured  Data.  In  Proceedings  of  the  Inter¬ 
national  Conference  on  Database  Theory  (ICDT’97),  Delphi,  Greece,  pages  1-18, 
Janvier  1997. 

2.  Serge  Abiteboul,  Dalian  Quass,  Jason  McHugh,  Jennifer  Widom,  and  Janet  L. 
Wiener.  The  Lorel  Query  Language  for  Semistructured  Data.  International  Journal 
on  Digital  Libraries,  1  (1)  :68 — 88,  1997. 

3.  Serge  Abiteboul  and  Victor  Vianu.  Regular  Path  Queries  with  Constraints.  In  Pro¬ 
ceedings  of  the  Sixteenth  ACM  SIGACT-SIGMOD-SIGART  Symposium  on  Prin¬ 
ciples  of  Databasests  (PODS’97),  Tucson,  Arizonaem  Sy,  pages  122-133.  ACM 
Press,  May  1997. 

4.  Peter  Buneman,  Susan  Davidson,  Gerd  Hillebrand,  and  Dan  Suciu.  A  Query  Lan¬ 
guage  and  Optimization  Techniques  for  Unstructured  Data.  In  Proceedings  of  the 
ACM  SIGMOD  International  Conference  (SIGMOD’96),  Montreal,  Canada,  pages 
505-516,  June  1996. 

5.  Cyril  Decleir,  Mohand-Said  Hacid,  and  Jacques  Kouloumdjian.  A  Database  Ap¬ 
proach  for  Modeling  and  Querying  Video  Data.  In  In  Proceedings  of  the  15th  In¬ 
ternational  Conference  on  Data  Engineering  (ICDE’99),  Sydney,  Australia,  March 
1999. 

6.  Ahmed  K.  Elmagarmid  and  Haitao  Jiang.  Video  Database  System:  Issues,  Products 
and  Applications.  Kluwer,  1997. 

7.  Ronald  Fagin.  Fuzzy  Queries  in  Multimedia  Database  Systems.  In  Jan  Paredaens, 
editor,  Proceedings  of  the  1998  ACM  SIGACT-SIGMOD-SIGART  Symposium  on 
Principles  of  Database  Systems  (PODS’98),  pages  1-10,  Seattle,  Washington,  USA, 
1998.  Invited  Paper. 

8.  Myron  Flickner,  Harpreet  Sawhney,  Wayne  Niblack,  Jonathan  Ashley,  Qian  Huang, 
Byron  Dom,  Monika  Gorkani,  Jim  Hafner,  Denis  Lee,  Dragutin  Petkovic,  D.  Steele, 
and  P.  Yanker.  Query  by  Image  and  Video  Content:  The  QBIC  System.  In  Marc  T. 
Maybury,  editor,  Intelligent  Multimedia  Information  Retrieval,  chapter  1,  pages  7- 
22.  1996. 

9.  John  W.  Lioyd.  Foundations  of  Logic  Programming.  Springer- Verlag,  1987.  Second 
edition. 

10.  Karen  L.  Myers.  Hybrid  Reasoning  Using  Universal  Attachment.  Artificial  Intel¬ 
ligence,  (67):329-375,  1994. 

11.  Svetlozar  Nestorov,  Serge  Abiteboul,  and  Rajeev  Motwani.  Extracting  Schema 
from  Semistructured  Data.  In  Laura  M.  Haas  and  Ashutosh  Tiwary,  editors,  Pro¬ 
ceedings  of  the  ACM  SIGMOD  International  Conference  on  Management  of  Data 
(SIGMOD’98),  pages  295-306,  Seattle,  Washington,  USA,  June  1998.  ACM  Press. 

12.  A.  Yoshitaka,  Y.  Hosoda,  M.  Yoshimitsu,  M.  Hirakawa,  and  T.  Ichikawa.  VIO- 
LONE  :  Video  Retrieval  by  Motion  Example.  Journal  of  Visual  languages  and 
Computing,  7:423-443,  1996. 

13.  J.Widom  Y.Papakonstantinou,  H. Garcia  Molina.  Object  Exchange  Across  Hetero¬ 
geneous  Information  Sources.  In  Proceedings  of  the  11th  International  Conference 
on  Data  Engineering  (ICDE’95),  Taipei,  Taiwan,  pages  251-260,  Mars  1995. 


High  Quality  Information  Retrieval  for  Improving 
the  Conduct  and  Management  of 
Research  and  Development 


Ronald  N.  Kostoff  1 

Office  of  Naval  Research,  800  N.  Quincy  St.,  Arlington,  VA  22217 
Internet:  kostofr@onr.navy.mil 


Abstract.  The  purpose  of  the  present  paper  is  to  convey  the  importance  of  high  quality 
information  retrieval  for  maximizing  progress  in  R&D,  and  to  present  generic  protocols 
for  constructing  high  quality  literature  queries.  The  paper  begins  with  an  example  of  the 
information  retrieval  limitations  characteristic  of  present  R&D  practices,  describes 
requirements  for  conducting  high  quality  information  retrieval,  and  presents  a  proposal 
for  expanding  dissemination  and  widening  access  to  high  quality  information  retrieval 
methods.  Retrieval  of  medical  R&D  information  was  selected  as  an  illustrative  example. 


1.  Introduction 

For  the  past  decade,  the  author  has  been  developing  methods  for  extracting  useful 
information  from  large  S&T  text  databases  [1,  2],  These  methods  have  been  based 
upon  the  latest  information  technology  concepts  and  algorithms,  and  can  offer 
literature  searches  that  are  extremely  comprehensive  with  high  signal-to-noise  ratios. 
As  part  of  a  recent  assessment  of  information  retrieval  techniques  [3],  the  author 
examined  many  biomedical  studies  that  included  literature  searches.  The  Science 
Citation  Index  (SCI)  Abstracts  of  these  studies  contained  the  queries  used  for  the 
literature  surveys.  These  queries  had  the  following  characteristics: 

1)  The  source  data  came  almost  exclusively  from  Medline  alone,  except  for  those 
studies  whose  objective  was  to  survey  the  Web  resources  available  for  the  target 
medical  issue; 

2)  The  focus  of  most  of  the  studies  seemed  to  concentrate  around  narrowly  defined 
medical  problems,  with  little  indication  offered  that  supporting  or  related  medical/ 
technical  areas  were  of  any  interest; 

3)  The  reported  queries  contained  3-6  phrases  on  average; 

4)  The  phrases  were  either  searcher-generated,  or  were  the  indexed  terms  from  the 
Medline  Mesh  taxonomy.  No  evidence  was  presented  that  an  exhaustive  search  of 
author-generated  terms  was  performed. 

1  THE  VIEWS  PRESENTED  IN  THIS  PAPER  ARE  SOLELY  THOSE  OF  THE  AUTHOR  AND  DO 
NOT  REPRESENT  THE  VIEWS  OF  THE  DEPARTMENT  OF  THE  NAVY. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  86-96,  2000. 
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Queries  with  the  above  characteristics  result  in  a  deficient  retrieved  information  base. 
These  deficiencies  translate  into  limitations  on  the  credibility  and  quality  of  study 
results  and  subsequent  research  and  development  (R&D),  for  the  following  reasons. 

1)  Searches  that  do  not  access  the  myriad  databases  available,  and  queries  that  do  not 
result  in  comprehensive  retrievals  of  the  information  available  in  the  databases  actually 
searched,  result  in  only  a  fraction  of  the  existing  knowledge  being  available  for  study 
and  R&D  exploitation. 

2)  Searches  and  queries  not  designed  to  a)  access  literatures  directly  supportive  of  the 
target  literature  and  b)  access  literatures  related  to  the  target  literature  by  some 
common  or  intermediary  thread,  will  not  provide  the  insights  and  discoveries  from 
these  other  disciplines  that  often  result  in  innovations  in  the  target  discipline  of 
primary  interest  [4] 

3)  Queries  that  are  severely  restricted  in  length,  that  rely  in  large  measure  on  generic 
indexer-supplied  terms,  and  have  not  been  extensively  iterated  with  the  author- 
supplied  language  in  the  source  database,  will  be  inadequate  in  capturing  the  myriad 
ways  in  which  different  authors  describe  the  same  concept,  and  will  also  yield  many 
records  that  are  non-relevant  to  the  main  technical  themes  of  the  study. 

In  summary,  these  types  of  simple  limited  queries  can  result  in  two  serious  problems:  a 
substantial  amount  of  relevant  literature  is  not  retrieved,  and  a  substantial  amount  of 
non-relevant  literature  is  retrieved.  As  a  result,  the  potential  user  is  either 
overwhelmed  with  extraneous  data,  or  is  uninformed  about  existing  valuable 
information,  leading  to  potential  duplication  of  effort  and/  or  R&D  based  on 
incomplete  use  of  existing  data.  All  the  subsequent  data  processing,  both  human  and 
computerized,  cannot  compensate  for  these  deficiencies  in  the  base  data  quality.  In 
contrast  to  these  typical  biomedical  study  Medline  queries  reported  in  the  SCI 
Abstracts,  the  author’s  group  has  been  developing  information  retrieval  techniques  [1, 
2]  using  an  iterative  relevance  feedback  approach.  The  source  database  queries  result 
in  retrieval  of  very  comprehensive  source  database  records  that  encompass  direct  and 
supporting  literatures  with  very  high  ratios  of  desired/  undesired  records.  Some  of  the 
queries  consist  of  hundreds  of  terms  [5,  6].  In  those  specific  cases,  queries  of  this 
magnitude  are  necessary  to  achieve  the  retrieval  comprehensiveness  and  ‘signal-to- 
noise’  ratio  required.  Queries  of  a  specific  size  are  not  a  query  development  target; 
rather,  the  query  development  process  produces  a  query  of  sufficient  magnitude  to 
achieve  the  target  objectives  of  comprehensiveness  and  high  relevance  ratio. 

The  reader  interested  in  more  details  about  the  query  development  protocols  discussed 
above,  as  well  as  the  larger  text  mining  context  in  which  they  are  imbedded,  is 
encouraged  to  contact  the  author.  An  excellent  overview  of  information  retrieval 
techniques  is  contained  in  [7].  Many  detailed  information  retrieval  technique 
descriptions  can  be  found  in  the  TREC  Conferences’  Proceedings  on  the  NIST  Web 
site,  and  the  SIGIR  Conferences’  Proceedings  on  the  ACM  Web  site. 
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2.  Importance  of  High  Quality  Information  Retrieval  to  Support 
S&T 

Information  retrieval  is  one  component  of  a  larger  information  extraction  and 
integration  process.  To  extract  useful  information  from  large  volumes  of  semi- 
structured  and  unstructured  S&T  text,  sophisticated  text  mining  (TM)  techniques  have 
been  generated  [5,  8,  9,  10,  11,  12].  TM  could  address  the  following  specific  issues 
that  arise  repeatedly  in  the  conduct  and  management  of  R&D: 

What  R&D  is  being  done  globally;  Who  is  doing  it;  What  is  the  level  of  effort; 

Where  is  it  being  done;  What  are  the  major  thrust  areas;  What  are  the  relationships 
among  major  thrust  areas;  What  are  the  relationships  between  major  thrust  areas  and 
supporting  areas,  including  the  performing  and  archiving  infrastructure;  What  is  not 
being  done;  What  are  the  promising  directions  for  new  research;  What  are  the 
innovations  and  discoveries? 

These  issues  can  be  divided  into  two  categories:  infrastructure  (who,  where,  when)  and 
technical  (what,  why).  To  address  these  issues  comprehensively,  TM  techniques 
typically  have  four  major  generic  components: 

(1)  Information  retrieval  to  select  raw  textual  data  on  which  the  information  processing 
will  be  performed; 

(2)  Bibliometrics  to  identify  the  people,  archival,  institutional,  and  regional 
infrastructure  of  the  topical  domain  being  analyzed; 

(3)  Computational  linguistics  to  extract  topical  themes  of  interest,  and  relationships 
among  these  themes  and  the  infrastructure  components: 

(4)  Visualization  and/or  other  types  of  information  display  that  summarize  the  TM 
analyses  and  results  for  the  users/  customers. 

While  all  four  data  mining  components  are  important  for  a  high  quality  useful  product, 
good  information  retrieval  is  fundamental  to  the  quality  of  the  results  and  the  latter 
three  components.  All  the  sophisticated  bibliometrics  and  computational  linguistics 
processing  cannot  compensate  for  insufficient  or  unfocused  base  data. 

In  order  to  maintain  awareness  of  global  R&D,  effectively  exploit  its  results,  and 
remain  at  the  cutting  edge  of  R&D,  the  medical  researchers,  clinicians,  and  sponsors 
need  to  understand: 

1)  R&D  done  in  the  past,  to  both  exploit  it  presently  and  not  repeat  mistakes  that  were 
made  in  the  previous  development; 

2)  R&D  being  conducted  presently,  to  both  leverage  existing  programs  for  optimal 
resource  use  and  avoid  duplication; 

3)  R&D  planned  to  be  conducted,  to  allow  a)  strategic  budgetary  planning  for  future 
R&D  transitions;  b)  planning  strategic  cost-sharing  in  areas  of  common  interest;  and  c) 
withdrawal  of  planned  budgets  from  areas  of  peripheral  interest  that  will  be  addressed 
elsewhere. 
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Any  technology  specialty  community  requires  this  information  both  in  R&D  areas 
directly  related  to  its  technologies  of  interest,  and  in  allied  and  disparate  technical 
fields  as  well.  These  supporting  technical  areas  can  serve  as  sources  of  innovation  and 
discovery  for  advancing  the  prime  technical  areas  (4),  and  can  help  remove  the 
underlying  critical  path  barriers  that  serve  as  roadblocks  to  progress  along  the  primary 
technical  paths.  Some  of  the  most  revolutionary  discoveries  from  TM/  information 
retrieval  have  occurred  in  the  medical  field,  resulting  from  linking  disparate  literatures 
to  the  primary  target  literature  [11-16]. 

Because  of  this  interlocking  nature  of  R&D,  results  from  many  different  types  of  R&D 
efforts  are  required  to  produce  advances  in  any  specific  area.  For  example,  advances 
in  biomedical  instrumentation  require  underlying  advances  in  materials,  electronics, 
signal  processing,  mathematical  analysis,  physics,  chemistry,  energy  conversion, 
radiation  sciences,  solid  and  fluid  mechanics,  robotics  and  micro-technology,  and  other 
technologies  depending  on  specific  applications.  Maximum  advances  in  non-invasive 
medical  diagnostics  require  access  to  the  latest  science  and  engineering  literature  in 
remote  sensing,  non-destructive  evaluation,  signal  and  image  processing,  pattern 
recognition,  multi-source  data  fusion,  fluid  dynamics,  acoustics,  robotics,  materials, 
electronics,  and  many  other  disciplines. 

R&D  sponsors  with  broad  mission  areas  have  an  additional  problem;  their  R&D  needs 
are  very  eclectic.  Results  from  many  different  types  of  R&D  are  required  in  order  to 
accomplish  the  overall  objectives  of  the  sponsoring  organizations.  However,  any 
organization  can  afford  to  sponsor  only  a  small  fraction  of  the  R&D  necessary  to 
provide  the  technical  foundations  for  accomplishing  its  broader  mission  objectives.  It 
is  imperative  for  any  organization  (that  requires  significant  technological  advances  to 
accomplish  its  broad  mission  objectives)  to  maintain  awareness  of  all  the  R&D  being 
performed  globally.  This  continual  awareness  will  allow  the  agency  or  company  to 
leverage  and  exploit  the  results  of  externally-sponsored  R&D  in  a  timely  manner  for 
its  own  and  national  benefit. 

The  technical  community  needs  access  to  a  variety  of  sources  for  this  global  R&D 
information,  to  gain  the  full  spectrum  of  perspectives  on  available  R&D.  These 
sources  include  human  contacts,  literature,  multi-media,  and  physical  sources. 
Advanced  information  retrieval  techniques  that  can  address  the  literature  in  particular 
are  becoming  available.  These  advanced  information  retrieval  methods  could  be  used 
as  the  cornerstone  of  a  process  that  would  both  extract  information  directly  from  the 
text  sources  as  well  as  use  the  preliminary  extracted  information  as  a  gateway  to  the 
other  data  sources.  For  example,  simple  processing  of  the  very  comprehensive 
information  retrieved  by  these  advanced  methods  will  identify  R&D  performers, 
journals,  organizations,  and  sponsors  [5,  6,  8].  These  sources  can  then  be  contacted  to 
provide  a  more  personal  type  of  information  retrieval,  and  supplement  the  literature- 
based  approach  extensively. 
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3.  Problems  with  Present  Information  Retrieval  Approaches 

The  information  retrieval/literature  surveys  performed  by  the  R&D  community  have 
not  kept  pace  with  the  breadth  and  expansion  of  literature  available.  Present 
information  retrieval  approaches  have  four  major  intrinsic  limitations: 

1)  They  access  only  a  fraction  of  available  source  databases,  due  to  a  combination  of 
lack  of  knowledge  of  the  existing  databases,  lack  of  interest  in  making  the  effort 
required  to  identify  the  complete  scope  of  existing  databases,  and  lack  of  appropriate 
tools  and  techniques  to  readily  access  the  full  spectrum  of  available  data  sources. 

2)  They  are  typically  limited  to  narrowly  focused  literatures,  either  due  to  the 
surveyor’s  lack  of  interest  in  going  beyond  the  directly  focused  target  area,  or  the 
surveyor’s  lack  of  knowledge  about  techniques  and  tools  available  to  readily  access 
allied  and  disparate  literatures  from  which  insights  and  discovery  could  be 
extrapolated. 

3)  They  devote  insufficient  effort  to  query  development,  due  to  lack  of  time  and  other 
resources  and/  or  lack  of  understanding  of  the  consequences  of  severely  deficient 
queries  on  the  quality  of  their  subsequent  R&D. 

4)  They  are  typically  based  on  user-supplied  terminology  rather  than  database  author- 
supplied  terminology,  due  to  lack  of  understanding  the  value  of  using  author  generated 
terms  and /  or  lack  of  knowledge  of  the  tools  and  techniques  available  to  extract  query 
terms  efficiently  from  the  authors’  own  writings. 


4.  Requirements  and  Mechanics  of  High  Quality  Information 
Retrieval 


4.1.  Requirements 

A  high  quality  query  should  have  the  following  operational  characteristics: 

1)  Retrieve  the  maximum  number  of  records  in  the  technical  discipline  of  interest 

2)  Retrieve  substantial  numbers  of  records  in  closely  allied  disciplines 

3)  Retrieve  substantial  numbers  of  records  in  disparate  disciplines  that  have  some 
connection  to  the  technical  discipline  of  interest 

4)  Retrieve  records  in  aggregate  with  high  signal-to-noise  ratio  (number  of  desirable 
records  large  compared  to  number  of  undesirable  records) 

5)  Retrieve  records  with  high  marginal  utility  (each  additional  query  term  will  retrieve 
large  ratio  of  desirable  to  undesirable  records) 

6)  Minimize  query  size  to  conform  to  limit  requirements  of  search  engine(s)  used 

Development  of  a  high  quality  query  requires: 

1)  Incorporation  of  technical  experts; 
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2)  In-depth  understanding  by  the  study  performers  of  the  contents  and  structure  of  the 
potential  databases  to  be  queried; 

3)  Sufficient  technical  breadth  of  the  study  performers  in  aggregate  to  understand  the 
potentially  different  meanings  and  contexts  that  specific  technical  phrases  could  have 
when  used  in  different  technical  areas  and  by  different  technical  cultures  (e.g.,  SPACE 
SATELLITES,  SATELLITE  CLINICS,  SATELLITE  TUMORS); 

4)  Understanding  of  the  relation  of  these  database  contents  to  the  problem  of  interest; 
and 

5)  Substantial  time  and  effort  on  the  part  of  the  technical  expert(s)  and  supporting 
information  technologist(s). 

Development  of  a  high  quality  query  is  complex  and  time  consuming,  with 
attendant  non-negligible  costs.  The  stringent  and  complex  development 
requirements  run  counter  to  the  unfounded  assertions  being  promulgated  by 
information  technology  algorithm  developers  and  vendors:  sophisticated  tools 
exist  that  will  allow  low-cost  non-experts  to  perform  comprehensive  and  useful 
data  retrieval  and  analysis  with  minimal  expenditures  of  time  and  resources. 

4.2.  Mechanics 

In  order  to  meet  the  requirements  for  a  high  quality  information  retrieval  process 
described  in  the  previous  section,  the  query  development  process  generically  needs  to 
be  full-text  based  and  iterative,  with  relevance  feedback  and  associated  query 
expansion  occurring  during  each  iteration.  A  small  core  group  of  documents  relevant 
to  the  topic  of  interest  is  identified  using  a  test  query.  Unique  characteristics  of  these 
core  documents  are  identified  from  bibliometrics  (authors,  journals,  institutions, 
sponsors,  citations)  and  computational  linguistics  (phrase  frequency  and  phrase 
proximity)  analysis.  Patterns  of  bibliometrics  and  phrase  relationships  in  existing 
fields  are  identified,  the  test  query  is  modified  (by  some  combination  of  human  experts 
and  intelligent  agents)  with  new  search  term  combinations  that  follow  the  newly 
identified  patterns,  and  the  process  is  repeated.  In  addition,  patterns  of  bibliometrics 
and  phrase  relationships  that  reflect  extraneous  non-relevant  material  are  identified, 
and  search  terms  that  have  the  ability  to  remove  non-relevant  documents  from  the 
database  are  added  to  the  modified  query.  This  iterative  procedure  continues  until 
convergence  is  obtained,  where  relatively  few  new  documents  are  found  or  few  non- 
relevant  documents  are  identified,  even  though  new  search  terms  are  added. 

The  specific  steps  used  in  these  generic  relevance  feedback  approaches  are 
summarized  as  follows: 

1)  Definition  of  study  scope; 

2)  Generation  of  query  development  strategy; 

3)  Generation  of  test  query; 

4)  Retrieval  of  records  from  database;  selection  of  sample; 

5)  Division  of  sample  records  into  relevant  and  non-relevant  categories,  or 
gradations  of  relevance; 

6)  Identification  of  bibliometric  and  linguistic  patterns  characteristic  of  each  category. 
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In  addition  to  using  computational  linguistics  for  characteristic  pattern  matching  in  the 
semi-structured  databases’  text  fields,  the  author  has  used  bibliometrics  for  pattern 
matching  in  the  following  other  fields  to  retrieve  more  relevant  records: 

6a)  Author  Field;  6b)  Journal  Field;  6c)  Institution  Field;  6d)  Sponsor  Field; 

6e)  Citation  Field; 

There  are  at  least  three  ways  in  which  the  citation  field  can  be  used  to  help  identify 
additional  relevant  papers. 

6ei)  Papers  that  Cite  Relevant  Documents; 

6eii)  Papers  Cited  by  Relevant  Documents; 

6eiii)  Other  Papers  Cited  by  Paper  that  Cites  Relevant  Documents; 

7)  Identify  marginal  value  of  adding  bibliometric  and  linguistic  patterns  to  the  query; 

8)  Construct  modified  query; 

9)  Repeat  process  until  convergence  obtained. 

While  the  generic  query  development  process  is  systematic  as  presented,  it  is  neither 
mechanistic  nor  automated  easily.  Judgement  must  be  used  at  each  detailed  step, 
especially  when  using  the  linguistic  patterns  from  the  text  fields  to  assist  in  the 
generation  of  new  query  terms.  Some  of  the  complexities  in  the  linguistics  pattern 
identification  will  be  summarized. 

Linguistic  patterns  uniquely  characteristic  of  each  category  (relevant  and  non-relevant 
records)  are  selected  to  modify  the  query.  The  underlying  assumption  is  that  records 
in  the  source  database  that  have  the  same  linguistic  patterns  as  the  relevant  records 
from  the  sample  will  have  a  high  probability  of  being  relevant,  and  records  in  the 
source  database  having  the  same  linguistic  patterns  as  the  non-relevant  records  from 
the  sample  will  also  be  non-relevant.  Linguistic  patterns  characteristic  of  the  relevant 
records  modify  the  query  such  that  additional  relevant  records  are  retrieved  from  the 
source  database.  Linguistic  patterns  characteristic  of  the  non-relevant  records  modify 
the  query  such  that  existing  and  additional  non-relevant  records  are  not  retrieved. 

To  expand  the  relevant  records  retrieved,  a  phrase  from  the  sample  records  should  be 
added  to  the  query  if  it: 

1)  appears  predominately  in  the  relevant  record  category; 

2)  has  a  high  marginal  utility  based  on  the  sample; 

3)  has  reasons  for  its  appearance  in  the  relevant  records  that  are  understood  well;  and 

4)  IS  PROJECTED  TO  RETRIEVE  ADDITIONAL  RECORDS  FROM  THE 
SOURCE  DATABASE  (E.G.,  SCI)  MAINLY  RELEVANT  TO  THE  SCOPE  OF 
THE  STUDY. 

If  the  candidate  query  phrase  extracted  from  the  sample  was  part  of  the  test  query,  the 
source  database  occurrence  projection  is  straight-forward.  If  the  candidate  query 
phrase  extracted  from  the  sample  was  not  part  of  the  test  query,  the  actual  source 
database  occurrence  ratio  in  relevant  and  non-relevant  records  may  be  far  different 
from  the  projection  based  on  the  ratio  of  frequency  of  occurrence  in  each  sample 
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category.  The  IR  example  discussed  in  the  next  paragraph  is  an  excellent 
demonstration  of  the  mis-estimate  of  total  source  database  occurrence  possible  with 
use  of  a  phrase  derived  from  the  linguistic  patterns  of  the  sample  but  not  part  of  the 
initial  test  query. 

As  an  example  from  the  query  development  in  a  recent  TM  study  on  the  discipline  of 
TM  (3),  the  phrase  IR  (an  abbreviation  for  information  retrieval  used  in  many  SCI 
Abstracts)  was  characteristic  of  predominantly  relevant  sample  records,  had  a  very 
high  absolute  frequency  of  occurrence  in  the  sample,  and  had  a  high  marginal  utility 
based  on  the  sample.  However,  it  was  not  ’projected  to  retrieve  additional  records  from 
the  source  database  mainly  relevant  to  the  scope  of  the  study’.  A  test  query  of  IR  in  the 
Science  Citation  Index  source  database  showed  that  it  occurred  in  65740  records 
dating  back  to  1973.  Examination  of  only  the  first  thirty  of  these  records  showed  that 
IR  is  used  in  science  and  technology  as  an  abbreviation  for  InfraRed  (physics), 
Immuno-Reactivity  (biology),  Ischemia-Reperfusion  (medicine),  current®  x 
resistance(R)  (electronics),  and  Isovolume  Relaxation  (medical  imaging).  IR  occurs  as 
an  abbreviation  for  information  retrieval  in  probably  one  percent  of  the  total  records 
retrieved  containing  IR,  or  less.  As  a  result,  the  phrase  IR  was  not  selected  as  a  stand¬ 
alone  query  modification  candidate. 

Consider  the  implications  of  this  real-world  example.  Assume  a  query  consists  of  200 
terms.  Assume  199  of  these  terms  are  selected  correctly,  according  to  the  guidelines 
above.  If  the  200Ih  term  were  like  IR  above,  then  the  query  developer  would  have  been 
swamped  with  an  overwhelming  deluge  of  unrelated  records.  ONE  MISTAKE  IN 
QUERY  SELECTION  JUDGEMENT  can  be  fatal  for  a  high  signal-to-noise  product. 
Careful  judgement  must  be  exercised  when  selecting  each  candidate  phrase.  When 
potentially  dominant  relevant  query  modification  terms  extracted  from  the  sample  are 
being  evaluated,  one  has  to  consider  whether  substantial  amounts  of  non-relevant 
records  will  also  be  retrieved  from  use  of  the  query  term  in  the  source  database. 
When  potentially  dominant  non-relevant  query  modification  terms  extracted  from  the 
sample  are  being  evaluated,  one  has  to  consider  whether  substantial  amounts  of 
relevant  records  will  not  be  retrieved. 

Thus,  the  relation  of  the  candidate  query  term  to  the  objectives  of  the  study,  and  to  the 
contents  and  scope  of  the  total  records  in  the  full  source  database  (i.e.,  all  the  records 
in  the  Science  Citation  Index,  not  just  those  retrieved  by  the  test  query),  must  be 
considered  in  query  term  selection.  The  quality  of  this  selection  procedure  will  depend 
upon  the  expert(s)’  understanding  of  both  the  scope  of  the  study  and  the  different 
possible  meanings  of  the  candidate  query  term  across  many  different  areas  of  R&D. 
This  strong  dependence  of  the  query  term  selection  process  on  the  overall  study 
context  and  scope  makes  the  ’automatic’  query  term  selection  processes  reported  in 
the  published  literature  very  suspect. 
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5.  Improving  the  Dissemination  of  High  Quality  Information 
Retrieval  Processes  and  Queries 

This  final  section  proposes  an  option  for  increasing  the  dissemination  of  technical 
discipline  queries  to  relevant  communities.  The  background,  need,  and  proposed 
alternatives  are  outlined.  Finally,  one  specific  additional  application  is  addressed 
briefly. 

5.1.  Background 

The  previous  sections  of  this  paper  have  shown  the  importance  and  complexity  of,  and 
effort  required  to  develop,  high  quality  database  queries.  Yet,  the  dissemination  of 
these  queries  to  other  potential  users  by  their  developers  leaves  much  to  be  desired. 
Other  than  inclusion  in  published  papers,  the  queries  are  essentially  not  distributed. 

A  fundamental  axiom  in  the  R&D  community  is  that  a  comprehensive  literature  survey 
should  be  performed  before  R&D  is  proposed  and  initiated.  Some  if  not  most.  Federal 
agencies  require  that  such  surveys  be  performed  before  R&D  is  started.  The  degrees 
to  which  these  requirements  are  enforced  and  the  survey  quality  and 
comprehensiveness  are  assessed,  are  unknown.  Thus,  there  may  be  a  lot  of  ’re¬ 
inventing  of  the  wheel’,  as  each  research  group  conducts  surveys  in  topical  areas 
similar  to  those  surveyed  previously.  In  addition,  if  the  're-invented'  literature  search 
is  not  of  the  same  caliber  as  the  original  (due  to  poorer  queries),  the  prospective 
researcher  will  not  have  the  comprehensive  global  data  to  exploit,  and  the  possibility 
for  duplication  of  effort  increases. 

5.2.  Need 

If  there  were  some  type  of  query  repository,  with  stringent  query  quality  requirements, 
much  of  this  redundancy  could  be  avoided.  Even  if  the  objectives  of  the  prospective 
literature  survey  were  somewhat  different  from  the  objectives  used  to  develop  a 
previous  query,  the  completed  query  could  be  used  as  a  credible  starting  point  for  the 
desired  query.  Many  researchers  will  have  neither  the  time  nor  tools  nor  specialized 
information  technology  capability  to  perform  comprehensive  queries.  Especially  for 
resource  intensive  queries  of  the  type  described  previously,  widespread  availability  of 
these  substantive  queries  could  be  of  high  value  for  a  wide  variety  of  researchers.  The 
question  arises  as  to  how  best  to  make  these  substantive  resource-intensive  queries 
widely  available  to  the  potential  user  community. 

One  feasible  method  would  be  to  establish  a  Web  site  at  one  of  the  existing  data 
repositories  (e.g.,  NTIS,  DTIC).  Queries  would  be  submitted  to  the  site  manager, 
subjected  to  some  review,  then  posted  on  the  site.  Sample  guidance  for  query 
submission  and  content  is  shown  below.  Both  the  broad-based  technical  journals 
(e.g.,  Science,  Nature)  and  the  specialty  technical  journals  (e.g.,  JAMA,  NEJM, 
Journal  of  Aircraft)  would  be  used  to  inform  readers  of  the  new  query  titles  that  have 
been  added  to  the  repository.  This  option  does  not  over-burden  the  expensive  journal 
real  estate,  but  does  inform  interested  readers  of  the  full  query's  location. 
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5.3.  Guidance  for  Submitting  Queries  to  Repository 

One  component  of  the  overall  repository  maintenance  protocol  would  be  guidance  to 
the  query  developers  for  submitting  their  completed  query  to  the  repository.  The  query 
developers  would  supply  information  describing  the  values  of  each  of  the  parameters 
on  which  the  query  depends.  The  required  information  follows. 

1)  Identify  Contents  of  Specific  Source  Databases  Used; 

2)  Specify  Fields  of  Source  Database  Used  to  Develop  Query; 

3)  Specify  Goals  and  Objectives  of  the  Study  whose  Literature  will  be  Retrieved  by 
the  Query; 

4)  Specify  the  Philosophy  and  Strategy  used  to  Develop  the  Query; 

5)  Specify  the  Technical  Backgrounds  and  Perspectives  of  Query  Developers; 

6)  Describe  the  Features  of  the  Search  Engine  Used,  and  any  Limitations  that  Impacted 
the  Final  Query; 

7)  Describe  any  Other  Events  or  Phenomena  Pertinent  to  the  Final  Query  Form; 

8)  Describe  the  Query  Metrics  (Records  Retrieved,  Relevant  Fraction). 


5.4.  Specific  Additional  Application 

One  of  the  functions  of  the  repository  could  be  to  serve  as  an  enforcement  mechanism 
for  credible  literature  surveys  for  Federal  grant  recipients,  if  the  repository  operation  is 
designed  properly.  Each  prospective  researcher  would  perform  the  requisite  literature 
search,  and  submit  the  query  and  associated  documentation  to  the  repository  gate¬ 
keepers.  The  query  and  supportive  documentation  would  be  reviewed  by  topical 
domain  experts,  and  any  deficiencies  due  to  poor  technique,  game-playing,  or  other 
reasons,  would  result  in  rejection  of  the  query.  Not  until  a  credible  high  quality  query 
were  accepted  would  the  R&D  be  allowed  to  proceed.  This  would  insure  that 
duplications  of  effort  are  minimized,  and  the  latest  documented  findings  of  the  global 
R&D  community  are  available  to  the  prospective  R&D  performer(s)  for  exploitation. 


6.  Summary  and  Conclusions 

Information  retrieval  plays  a  central  role  in  modem  day  R&D.  Present-day 
information  retrieval  techniques  in  wide  use  have  limited  capabilities  compared  to 
what  the  state  of  information  technology  can  provide.  Use  of  these  inadequate 
retrieval  techniques  can  result  in  an  excess  of  non-relevant  records,  and  retrieval  of  a 
fraction  of  the  relevant  records  available,  all  of  which  translates  into  waste  of  limited 
R&D  resources.  State-of-the-art  information  retrieval  capabilities  require  time  for 
high  quality  query  development,  and  the  costs  of  this  development  are  not  negligible. 
The  potential  for  net  cost  savings,  due  to  the  elimination  of  duplication  and  use  of 
complete  data  possible  with  use  of  advanced  queries,  is  high.  Once  these  high  quality 
queries  have  been  developed,  they  should  be  disseminated  to  the  broadest  segment  of 
the  technical  community,  and  archived.  One  mechanism  for  accomplishing  this 
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dissemination  and  archiving  is  through  establishment  of  a  query  repository,  with 
attendant  advertising  of  the  repository’s  contents  through  the  technical  journals, 
bulletin  boards,  professional  society  home  pages,  and  other  dissemination  forums. 
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Abstract.  Image  matching  and  content-based  spatial  similarity  assessment  based 
on  the  2D-String  image  representation  has  been  extensively  studied.  However, 
for  large  image  databases,  matching  a  query  against  every  2£>-String  has 
prohibitive  cost.  Indexing  techniques  are  used  to  filter  irrelevant  images  so  that 
image  matching  algorithms  can  only  focus  on  relevant  ones.  Current  2£>-String 
indexing  techniques  are  not  efficient  for  handling  large  image  databases.  In  this 
paper,  the  Two  Signature  Muld-Level  Signature  File  (2SMLSF)  is  used  as  an 
efficient  tree  structure  that  encodes  image  information  into  two  types  of  binary 
signatures.  The  2SMLSF  significantly  reduces  the  storage  requirements,  responds 
to  more  types  of  queries,  and  its  performance  significantly  improves  over  current 
techniques.  For  a  simulated  image  databases  of  131,072  images,  a  storage 
reduction  of  up  to  35%  and  a  querying  performance  improvement  of  up  to  93% 
were  achieved. 


1.  Introduction 

Several  logical  image  representation  techniques  for  spatial  similarity  retrieval  have 
been  previously  proposed  such  as  GR-Strings  [1],  Spatial  Orientation  Graphs  (SOG) 
[2,3]  and  the  symbolic  image  [4],  Efforts  are  underway  within  the  MPEG-7  (formally 
called  the  Multimedia  Content  Description  Interface)  standard  to  standardize 
multimedia  content  description  [5].  Symbolic  projection  methods  for  image 
representation  based  on  the  2D-String  were  introduced  in  [6].  Among  those 
techniques,  the  2£>-String  is  the  most  studied.  Various  extensions  of  2D-Strings  have 
been  proposed  as  the  2D-G  String  [7],  the  2D-C  String  [8]  and  the  2D-C  String  [9]  to 
deal  with  situations  of  overlapping  objects  with  complex  shapes.  The  2D-String 
representation  changes  the  problem  of  pictorial  information  retrieval  into  a  problem  of 
2D  sub-sequence  matching.  2D-Strings  allow  for  matching  images  based  on  the 
perception  of  the  objects  and  the  spatial  relations  that  exist  between  them,  thus 
providing  high-level  object-oriented  search  rather  than  search  based  on  the  low-level 
image  primitives  of  objects  such  as  color,  texture,  and  shape. 
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To  extract  the  2D-String  of  a  grayscale  image,  image  understanding  and  pattern 
recognition  techniques  [e.g.  10]  are  used  to  extract  the  pictorial  objects  included  in  the 
image.  In  spite  of  the  amount  of  image  segmentation  research  in  the  past  years,  a 
general  algorithm  for  segmenting  general  images  has  not  yet  been  developed. 
However,  segmentation  has  been  somewhat  successful  on  specific  applications  such  as 
medical  imaging,  e.g.  in  brain  MR  images  [11].  Although  computationally  expensive, 
the  process  of  image  understanding  is  performed  only  once  when  the  image  is  inserted 
into  the  database. 

In  a  large  image  database,  matching  a  query  image  sequentially  with  every 
database  image  is  not  feasible.  Indexing  techniques  are  used  to  filter  out  irrelevant 
images  to  improve  the  search  speed.  A  successful  indexing  mechanism  should  not  drop 
relevant  images  ( true  disposals),  and  should  try  to  minimize  the  number  of  irrelevant 
images  (false  alarms).  The  index  should  be  dynamic,  capable  of  answering  different 
types  of  queries,  and  should  be  efficient  in  terms  of  performance  and  storage  space 
required.  Several  techniques  for  indexing  2D-String  databases  have  been  proposed  [12- 
14].  However,  those  techniques  may  only  be  used  for  answering  specific  queries  and 
their  performance  degrades  considerably  in  large  image  databases. 

Signature  files  have  been  widely  employed  in  information  retrieval  of  both 
formatted  and  unformatted  data  [14-18]  and  recently  to  image  databases  [19,20]. 
Signatures  are  commonly  calculated  using  superimposed  coding  in  which  each  object 
(or  object  pair)  in  an  image  is  hashed  into  a  word  signature.  An  image  signature  is 
generated  by  superimposing  (ORing)  all  its  individual  signatures.  To  resolve  a  query, 
the  query  signature  is  generated  and  matched  (ANDe d)  against  image  signatures.  Fig.l 
shows  an  example  for  generating  an  image  signature  from  object  signatures  and  the 
results  of  matching  different  query  signatures  to  the  image  signature. 

The  main  contribution  of  this  paper  is  introducing  an  efficient  indexing  technique 
for  large  2D-String  image  databases  based  on  the  two-signature  multi-level  signature 
file  ( 2SMLSF)  [19]  and  providing  comparisons  of  the  2SMLSF  to  existing  indexing 
techniques  using  a  large  simulated  image  database.  Comparisons  of  the  2SMLSF  to 
existing  2D-String  indexing  techniques  revealed  that  the  2SMLSF  tree  significantly 
reduces  the  storage  requirements,  responds  to  more  types  of  queries  and  significantly 
improves  the  search  performance.  The  rest  of  this  paper  is  organized  as  follows:  In 
section  2,  an  overview  of  2D-Strings  is  given.  In  section  3,  several  2D-String  indexing 
techniques  are  discussed.  The  2SMLSF  technique  is  introduced  in  section  4. 
Comparisons  of  the  2SMLSF  to  existing  techniques  are  given  in  section  5  followed  by 
conclusions  in  section  6. 


2, 2D- String  Overview 

The  2D-String  of  a  symbolic  picture  [6]  transforms  an  image  into  a  two  dimensional 
string  by  projecting  the  objects  of  that  picture  along  the  x-  and  y-  coordinates.  Thus, 
the  2D-String  is  a  pair  of  ID  strings  (u,  v),  u  represents  the  spatial  relationships 
between  the  pictorial  objects  along  the  X-axis  while  v,  represents  those  along  the  Y- 
axis.  In  u  and  v,  “<”  denote  is-west-of  and  is-south-of  relationships,  respectively.  For 
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example,  consider  the  symbolic  picture  f  shown  in  Fig.2.(a)  The  symbols  a,  b,  c  and  d 
represent  pictorial  objects.  The  2D-String  representation  of /is  S=(d<ab<c,  a<bc<d). 

A  spatial  query  is  also  represented  by  a  2D-String.  Thus,  the  problem  of  image 
retrieval  becomes  that  of  2D  subsequence  matching.  In  [6],  three  types  of  2D 
subsequences  were  defined,  namely,  type-0,  type-7  and  type-2.  A  string  u  is  a  type-/ 
subsequence  of  string  v,  if  u  is  contained  in  v  and  if  alwlbI  is  a  substring  of  u,  a, 
matches  a2  in  v  and  b,  matches  b2  in  v,  then: 

(type-0)  r(b2)  -  r(a2)  >r(b,)  -  r(a ,)  or  rib,)  -  r(a,)  =  0 

(type-7)  r(b2)  -  r(a2)  >r(b,)  -  r(at)  >0  or  r(b2)  -  r(a2)  =  r(b ,)  -  r(a,)  =  0 
(type-2)  r(b2)  -  r(a2)  =  r(b ,)  -  r(a,) 

where  r(x),  the  rank  of  a  symbol  x,  is  defined  to  be  one  plus  the  number  of  “<” 
symbols  preceding  x.  Let  (u,  v)  and  (u\  v’j  be  the  respective  2D-String  representations 
of /and/.  Then,  (u\  v’)  is  a  type-/  2D  subsequence  of  (u,  v),  if  u’  is  type-/  ID 
subsequence  of  u  and  v’  is  type-/  ID  subsequence  of  v.  The  picture/  is  then  said  to  be 
type-/  subpicture  of  /.  In  this  paper,  type-2  spatial  relationships  are  used  in  all  the 
experiments  performed.  For  example,  consider  the  four  images  /,  f,  f2  and  /  shown  in 
Fig. 2  [6].  The  2D-String  representation  of /,/,/  and /  are:  f:  (d  <  ab  <  c,  a  <  be  < 

d),  f,:  (a  <  c,  a  <  c),  f2:  (d  <  a,  a  <  d)  and  f3:  (d  <  ac,  a  <  dc),  then:  the  type-0 
subpictures  of /are/,/2  and/„  the  type-7  subpictures  of/ are /  and /,  and  the  type-2 
subpicture  of/ is/, 


Object  Sigs.:  (D):  001  000  110  010 
(E):  010  001  100  010 
(H):  001  000  110  010 
(J) :  001  010  110  000 


Image  Signature:  011  011  110  010 

(c)  Object  and  image  signatures 


(a)  Example  Image 
with  4  objects:  Dog(D), 
Deer  (E),  Jockey  (J)  and 
Horse  (H) 


Queries  Signature 

1)  Deer  010  001  100  010 

2)  Desk  000  010  100  101 

3)  Deer  &  Jockey  011011110010 

4)  Car  010  010  100  000 


Result 

Match 
No  Match 
Match 
False  Drop 


{E<JH<D,  H<D<E} 

(b)  2D-String  rep.  (d)  Sample  queries  and  matching  results 

Fig.l.  Signature  generation  and  comparison  based  on  superimposed  coding 


Fig.2.  2D-String  example 
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3.  Previous  Work  on  Indexing  2D-String  Image  Databases 

In  [21],  the  2D  longest  common  subsequence  method  for  2D-String  matching  was 
proposed.  The  problem  of  string  matching  is  transformed  into  a  maximal  common 
subgraph  (clique)  which  has  exponential  complexity.  In  addition,  each  query  must  be 
matched  against  all  images  in  the  2D-String  database. 

In  [12],  a  2D-String  was  indexed  based  on  all  object  pairs  included  in  the  image. 
For  each  pair  oj  and  o,  an  ordered  triplet  is  created  ( of  op  rj  and  entered  into  a  hash 
table,  where  r#  is  the  spatial  relationship  between  the  two  objects.  Each  pair  of  query 
objects  acts  as  a  separate  query  used  to  retrieve  the  set  of  images  stored  at  the 
corresponding  hash  table  address.  The  intersection  of  the  retrieved  sets  constitutes  the 
candidate  set  of  images.  This  addressing  scheme  requires  that  all  images  are  known  in 
advance.  A  preprocessing  step  is  needed  to  derive  a  perfect  hash  function,  which 
ceases  to  be  perfect  when  new  images  are  inserted  into  the  database. 

Another  approach  is  based  on  groups  of  two  or  more  objects,  called  "image 
subsets",  was  introduced  [13].  All  image  subsets  from  2  up  to  a  specified  size  Kmax  are 
produced.  The  number  of  image  subsets  becomes  very  large  especially  when  Kmax  >  5 
and  n  (the  number  of  objects  per  image)  >  10  which  renders  this  method  unsuitable  for 
large  image  databases.  A  separate  hash  table  is  created  for  image  subsets  with  the 
same  number  of  objects  up  to  a  Kmax  objects.  Queries  with  a  number  of  objects  >Kmnx 
have  to  be  decomposed  into  multiple  smaller  queries.  In  a  simulation  test  on  a  1000 
image  database,  the  retrieval  response  time  was  slower,  in  some  cases,  than  that  of  a 
sequential  search.  In  addition,  this  technique  has  a  significant  storage  overhead. 

The  two  level  signature  file  ( 2LSF)  [20]  uses  a  two  level  signature  file  to  represent 
the  2D-Strings  in  the  image  database.  In  2LSF,  several  2D-Strings  are  grouped  as  a 
block.  Each  2£)-String  is  associated  with  a  record  (leaf)  signature  and  a  block  (root) 
signature.  The  bit-sliced  two-level  signature  file  ( BS2LSF)  introduced  in  [22]  uses  bit- 
transposed  files  to  improve  the  performance  of  image  retrieval  of  the  2LSF  at  the 
expense  of  insertion  cost.  The  5-tree  [23]  is  a  multilevel  signature  file  that  creates 
higher  level  signatures  by  superimposing  signatures  at  lower  levels.  As  more 
signatures  are  included,  the  bit  density  of  the  signatures  will  increase  rendering  the 
method  useless  due  to  a  large  number  of  false  alarms.  The  multilevel  signature  file 
( MSLF)  [14]  is  a  multi-level  extension  to  the  2LSF  for  text  retrieval.  The  bit  density 
problem  of  the  S-tree  does  not  exist  in  the  MLSF  since  signatures  at  higher  levels  have 
longer  lengths  and  are  generated  independently  from  those  at  lower  levels. 

4.  The  Two  Signature  Multi-level  Signature  File  Technique 
(2SMLSF) 

The  Two  Signature  Multi-Level  Signature  File  ( 2SMLSF)  (Fig. 3.)  [19]  creates  a  tree 
structure  where  images  or  groups  of  images  are  represented  by  binary  signatures, 
namely  Type_5  at  the  leaf  level  and  Type  O  at  all  other  levels.  The  equations  used  to 
calculate  w,  the  signature  weight  or  the  number  or  ones,  and  m,  the  signature  width  at 
different  levels  of  the  tree,  are  obtained  so  that  the  global  false  drop  probability  is 
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minimized  [17,18],  The  one  bits  are  randomly  chosen  and  each  of  the  possible 
signatures  is  equally  likely  to  be  chosen,  then  m  and  w  may  be  calculated  as  follows: 

nun 


where  F*  is  the  false  drop  probability  and  s  is  the  number  of  distinct  items  to  be 
encoded  to  create  a  signature. 

A  multilevel  signature  file  is  a  forest  of  b- ary  trees  with  every  node,  except  leaf 
nodes,  in  the  structure  having  b  child  nodes.  The  number  of  levels  in  the  structure  is  h. 
The  trees  are  assumed  to  be  complete  b- ary  trees  (n  =  bh).  Local  parameters 
representing  the  value  of  some  global  parameter  p  at  level  i  are  denoted  pr  To  further 
simplify  the  analysis,  it  is  assumed  that  the  local  false  drop  probability  is  the  same  at 
every  level.  The  relationship  between  the  global  and  local  false  drop  probabilities  is: 

=  “0  ft'-W  ,3) 

n  ;=l 

The  2SMLSF  uses  two  types  of  signatures  for  each  image.  TypejO  signatures  used 
at  all  levels  except  the  leaf  level  and  are  based  only  on  the  objects  included  in  the 
image  while  Type_S  signatures  are  used  only  at  the  leaf  level  and  are  based  on  the 
included  objects  in  addition  to  their  spatial  relationships.  For  an  image  1  with  x  objects, 
there  exists  x(x-l)/2  object  pairs.  From  equation  (2),  the  storage  requirement  of  the  two 
types  of  signatures  is  given  by  the  following  equations: 

mo  =f— LYJj-l  m,  J_LY  J 

M  \p,)  l>n2J  2  V/J 

The  ratio  of  storage  requirement  of  both  types  of  signatures  is  then  calculated  as: 

ms  (x  - 1) 

mo~  2  (4) 

A  Type_S  signature  requires  more  storage  than  a  TypejO  signature  whenever  x  > 
3.  For  large  image  databases,  a  substantial  reduction  in  storage  and  improvement  in 
query  performance  will  be  achieved  when  TypejO  signatures  are  used.  However, 
Type_S  signatures  may  answer  exact  queries  about  the  included  objects  and  their 
spatial  relationships  while  TypejO  signatures  may  only  answer  existential  queries 
about  the  objects  included  in  an  image.  Current  signature  based  methods  for  2D-String 
indexing  [20,22]  use  only  Type_S  signatures  for  indexing.  In  the  2SMLSF,  both  types 
of  signatures  are  used  for  image  encoding,  which  allows  the  2SMLSF  to  respond  to 
both  types  of  queries. 


4.1.  Index  Creation  in  the  2SMLSF 

The  algorithm  used  to  create  the  2SMLSF  signature  tree  (Fig.4.)  creates  h  independent 
signatures  for  each  image,  one  for  each  level  in  the  tree.  At  the  leaf  level,  each 
pairwise  spatial  relationship  contained  in  a  2D-String  is  represented  by  a  spatial  string. 
For  any  two  objects  A  and  B  where  A  is  less  than  B  in  alphabetical  order,  let  r(x)  be  the 
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rank  of  object  x  in  a  ID  string.  The  type-2  spatial  character,  V1(A,  b),  denoting  the 
type-2  spatial  relationship  between  A  and  B  is  defined  as  follows  [22]: 


Fig. 3.  The  Two  Signature  Multi  Level  Signature  File  ( 2SMLSF ) 


(type-2)  V2(A,  b)  =  “00”  if  r(A)  =  r(B) 

V2(A,  b)  =  “1”  +  Str(r(B)  -  r(A))  if  r(A)  <  r(B) 

V2(A,  b)  =  “2”  +  Str(r(B)  -  r(A))  if  r(A)  >  r(B) 

where  “+”  denotes  string  concatenation  and  Str(x)  is  a  transformation  function  from 
integer  a:  into  string  “a”.  The  type-2  spatial  string,  S2(A,  B),  is  the  concatenation  of  the 
two  symbols  A ,  B  and  the  two  type-2  spatial  characters  V2U(A,B)  and  V2JA,B),  where 
V2JA,B)  and  V2JA,B)  are  the  type-2  spatial  characters  of  A,  B  in  ID  strings  u  and  v, 
respectively.  For  example,  in  S=( u,  v)  =  (ad  <  b  <  c,  ac  <  b  <  d),  rj a )  =  1  <  rjc)  =  3 
and  rja)  =  1  =  rv(c)  =  1.  The  type-2  spatial  string  S2( a,  c)  representing  the  type-2 
spatial  relationships  between  a  and  c  is:  S2(a,  c)  =  “acl200” 

The  leaf  signature  for  an  object  pair  of  width  mh  and  weight  wh  is  calculated  using 
equations  (1)  and  (2)  using  wh  hash  functions.  The  image  signature  is  created  by 
superimposing  ( ORing )  all  object  pair  signatures  in  the  image.  For  any  other  level  in 
the  tree,  an  image  signature  of  width  m.  and  weight  w.  is  based  only  on  the  objects 
included  in  the  image.  Object  signatures  are  then  superimposed  to  generate  the  image 
signature  at  level  i.  All  the  signatures  of  images  in  the  same  block  are  then 
superimposed  to  generate  the  block  signature.  In  general,  2'  image  signatures  are 
superimposed  to  generate  the  block  signature  at  level  h-i.  A  pointer  to  the  signatures  at 
the  next  lower  level  is  associated  with  each  non-leaf  block  signature  and  a  pointer 
from  the  leaf  signature  to  the  corresponding  2D-String  is  created.  The  2D-String  then 
points  to  the  corresponding  physical  image. 

An  example  image  database  of  8  images  is  shown  in  Fig. 5. (a).  Each  image  has 
between  4  and  6  objects  selected  from  10  distinct  objects.  The  corresponding  MLSF 
has  3  levels  ( log2  8).  The  total  number  of  signature  bits  in  the  tree  (excluding  pointers) 
is  278  bits.  The  corresponding  2SMLSF  (Fig.5.(b))  also  has  3  levels  (log2  8).  The  total 
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number  of  signature  bits  in  the  tree  (excluding  pointers)  is  192  bits,  a  saving  of  about 
30%  over  the  MLSF.  Note  that  the  leaf  level  signatures  in  both  techniques  are  the  same 
since  both  techniques  use  Type-5  signatures  at  the  leaf. 

4.2.  Query  Processing  in  the  2SMLSF 

Two  types  of  queries  may  be  submitted  to  the  2 SMLSF,  the  first,  Type_S  queries, 
“Find  all  images  which  include  the  set  of  objects  and  satisfy  all  spatial  relations  in  the 
query  image”,  and  the  second,  TypejO  queries,  ’’Find  all  images  which  include  the  set 
of  objects  in  the  query  image”.  Only  Type_S  queries  may  be  submitted  to  the  2LSF,  the 
MLSF,  and  most  other  indexing  techniques  discussed  above. 

When  a  query  image  q  is  submitted  to  the  2SMLSF,  a  2D-String  representation  is 
created  for  q,  then  h  signatures  labeled  qr  q2,  ..,  qh  are  generated  for  q  using  the 
algorithm  in  Fig.4.  Starting  at  the  root,  the  query  signature  q,  is  ANDed  to  all  root 
signatures.  If  the  result  is  not  exactly  qn  it  is  certain  that  there  are  no  images 
underneath  the  root  that  satisfy  this  query  (unsuccessful  search).  If  for  a  certain  root 
signature,  the  result  is  exactly  q t,  there  is  a  chance  that  some  images  underneath  this 
signature  satisfy  the  query.  The  search  then  resumes  with  the  signatures  in  the  next 
level  underneath  qualifying  root  signatures. 

For  a  Type_S  query,  the  search  is  repeated  until  the  leaf  level  is  reached.  All  leaf 
signatures  that  match  the  qh  signature  are  query  candidates.  For  a  TypejO  query,  the 
search  is  repeated  until  the  level  right  above  the  leaf  level  is  reached  since  this  is  the 
last  level  that  uses  TypejO  signatures.  All  matching  signatures  at  this  level  are  query 
candidates.  The  false  drop  probability  of  the  TypeJS  queries  is  //  while  that  for  the 
TypejO  queries  is  slightly  higher  since  the  search  stops  above  the  leaf  level. 

pJ -Up/ <pTm 

(5) 

since  p[  <  1  and  h>l,  then  ( p[f  m  >  p{.  The  difference  between  the  two  probabilities 
decreases  by  increasing  the  number  of  levels  in  the  tree.  Due  to  the  information  loss  in 
representing  images  by  signatures,  a  false  drop  may  occur.  The  2D-String  pointed  to 
by  the  query  candidates  are  passed  to  a  spatial  similarity  algorithm  [e.g.3,12],  which 
performs  a  detailed  checking  in  order  to  exclude  false  drops  and  rank  the  other 
candidates  based  on  their  degree  of  similarity  to  the  query  image. 


5.  Evaluation  of  the  2SMLSF 

In  [19],  analytical  comparisons  were  performed  between  the  2SMLSF ,  the  2LSF 
[14,20],  and  the  MLSF  [14].  In  this  paper,  simulations  of  a  large  image  database  were 
carried  out.  Two  criteria  were  used  for  comparison,  the  amount  of  storage  required 
(M)  and  the  total  number  of  bits  (B)  compared  during  query  processing.  Two 
parameters  were  used  to  quantify  the  comparisons,  the  storage  reduction  ratio  ( SRR ) 
and  the  computation  reduction  ratio  ( CRR )  calculated  as  follows: 
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SRR  = 


MjLSF  M2SMLSF 


CRR  =  — 2LSF  — ISMLSF 

r  cf 


Input  Given  the  values  of:  n  :  number  of  images  in  the  database 

//  :  global  false  drop  probability 

x  :  average  number  of  objects  per  image 

Procedure 

Step  1:  //  initialization 

h  =  logbn  and  b=2.  II  h  is  the  number  of  levels 
Vi,  i  =  1,2, h 


//  local  false  drop  probability  at  level  i 


//  signature  weight 


II  signature  width 


Step  2:  //  For  every  image  I,  create  a  leaf  Type_S  signature  (SIGIh) 

VImage  I,  I  =  1,  2,  n 
SIGft=0 

V Object  i  el,  i  =  1,  2, x, 

V Object  j  el,  j  =  2,  3, ....  x, 

SPijh=  f(IDj,  IDj,  SpatiaLRelation^) 

SIGljh  =  create_signature(wb,  mb,  SPijh)  //  as  in  section  4. 1 . 
SIGlt=OR(SIGtt,SIGijh) 

Add  a  pointer  from  the  leaf  signature  to  the  2D-String  of  this  image. 

Step  3:  //  create  h-1  signatures  at  other  levels  as  follows 

V  Level  L,L  =  h-l,h-2 . 1 

nb  =  T  II  number  of  images  per  block 

VBlock  B,  B  =  1,  2,  n/nb 
SIGbl=0 

VImage  IeB,  I  =  1, 2, ....  nb 
SIG,l=  0 

V Object  i  el,  i  =  1,  2,  ...,  x, 

SPiL=  f(ID.) 

SIGiL  =  create_signature(wb,  mb,  SPiL)  //  as  in  section  4. 1 . 
SIG,L  =  OR(SIG1L,SIGiL) 

S1GBL=  OR(SIGbl,  SIGil) 

Adjust  pointers  to  next  level. 


Fig. 4.  The  2SMLSF  tree  creation  algorithm 
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{3  <  4,5,6  <  7;  6  <  3,4  <  5,7}  {2,3  <  1,5,6;  6  <  1,2  <  3,5}  {6,7  <3,5<  1,10;  5  <  1,6,7,10  <3}  {3,9  <  6  <  2,4,10;  4,6,9  <  3  <  2,10} 

(a)  The  example  image  database  and  2D-String  representation 


(b)  The  corresponding  2SMLSF  representation 
Fig.5.  Image  database  example 

In  [14],  it  was  shown  that  the  storage  requirement  for  the  SLSF  (Single  Level 
Signature  File),  2LSF,  and  MLSF  is  the  same,  if  they  all  have  the  same  global  false 
drop  probability  while  the  MLSF  significantly  reduces  the  number  of  bits  compared 
during  query  processing.  It  was  also  shown  that  the  optimum  blocking  factor  is  b=2, 
i.e.  when  each  node  in  the  MLSF  includes  exactly  2  signatures. 

Several  simulations  were  carried  out  between  the  2SMLSF  and  the  MLSF  [14].  In 
the  first  simulation,  the  number  of  objects  per  image  was  chosen  at  random  between  5 
and  9  objects,  images  were  divided  into  3x3  blocks  while  the  false  drop  probability 
was  chosen  to  be  0.1.  The  simulation  studied  the  effect  of  increasing  the  number  of 
images  on  the  storage  requirement  and  the  query  performance.  The  number  of  images 
was  changed  from  8K  (8192)  images  to  128K  (131,072)  (Fig. 6.).  The  average  SRR 
improvement  for  this  experiment  was  35%.  For  performance  comparisons,  two  types 
of  queries  were  considered,  random  queries  in  which  query  images  are  created 
randomly  and  selected  queries,  in  which  query  images  are  selected  from  images  in  the 
image  database.  Thus,  selected  queries  are  guaranteed  to  be  in  the  database,  while 
random  queries  may  lead  to  unsuccessful  searches.  The  CRR  was  89.19%  on  the 
average  for  random  queries  (Fig.7)  and  50.12%  on  the  average  for  selected  queries 
(Fig. 8). 

In  the  second  simulation,  the  number  of  images  in  the  database  was  kept  constant 
at  32K  (32768  images)  while  the  number  of  objects  per  image  was  varied  from  3  to  7 


6.  Conclusions 

The  Two  Signature  Multi-Level  Signature  File  ( 2SMLSF)  was  used  for  indexing  large 
2D-String  image  databases.  The  2SMLSF  is  based  on  signature  representation,  which 
does  not  allow  any  true  dismissals.  However,  false  alarms  may  occur  with  a  certain 
controlled  probability  {/.  The  value  of  //  may  be  reduced  on  the  expense  of  additional 
storage.  Two  types  of  signatures  are  generated  for  each  2D-String.  Type_S  signatures 
stored  at  the  leaf  and  are  based  on  the  included  domain  objects  and  their  spatial 
relationships  and  TypejO  signatures  used  at  all  the  other  levels  of  the  signature  tree 
and  are  based  only  on  the  domain  objects  included  in  the  image.  Simulation 
comparisons  of  the  2SMLSF  to  the  MLSF  in  terms  of  storage  requirements  (SRR)  and 
search  performance  ( CRR )  were  performed.  Simulations  of  image  databases  with  a 
variable  number  of  images  up  to  128K  images  and  a  variable  number  of  objects  per 
image  have  confirmed  that  the  proposed  indexing  technique  significantly  improves 
both  the  SRR  (up  to  35%)  and  the  CRR  (up  to  93%)  of  existing  techniques.  In  addition, 
the  2SMLSF  can  answer  both  general  and  exact  queries  while  the  MLSF  can  only 
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answer  exact  queries.  Future  extension  of  this  work  includes  testing  the  indexing 
technique  in  a  real  environment  such  as  medical  image  databases  and  extending  the 
2SMLSF  to  other  2D-String  types  such  as  2D-G,  2D-C,  and  2D-C+  strings. 
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Abstract.  We  present  a  framework  for  theory  refinement  operators  ful¬ 
filling  properties  that  ensure  the  efficiency  and  effectiveness  of  the  learn¬ 
ing  process.  A  refinement  operator  satisfying  these  requirements  is  de¬ 
fined  ideal.  Past  results  have  demonstrated  the  impossibility  of  defining 
ideal  operators  in  search  spaces  ordered  by  the  logical  implication  or  the 
0-subsumption  relationships.  By  assuming  the  object  identity  bias  over 
a  space  defined  by  a  clausal  language  ordered  by  logical  implication,  we 
obtain  01-implication,  a  novel  ordering  relationship,  and  show  that  ideal 
operators  can  be  defined  for  the  resulting  search  space. 


1  Introduction 

In  this  paper  we  continue  our  work  presented  in  [4,  14,  3]  on  the  definition  of  a 
framework  fulfilling  properties  that  are  deemed  as  desirable  for  the  incremental 
inductive  synthesis  of  logic-based  knowledge.  Such  properties  are  ensured  by  the 
notion  of  ideality  of  the  definable  refinement  operators,  that  provides  efficiency 
and  effectiveness  of  the  learning  process. 

Ideal  operators  have  been  proven  not  to  exist  in  the  spaces  ordered  by  the 
classical  notions  of  implication  or  0-subsumption  [15] .  Our  framework  relies  on 
the  Object  Identity  assumption,  that,  when  applied  to  the  standard  ordering 
relationships,  induces  changes  upon  the  corresponding  search  spaces  that  allow 
for  the  existence  of  ideal  operators. 

After  introducing  the  assumption  for  the  search  space  induced  by  0-sub- 
sumption,  yielding  the  fbrsubsumption  relationship  [4,  14],  we  now  weaken  the 
implication  ordering,  obtaining  a  OI- implication  [5,  3].  It  has  been  shown  how 
to  define  ideal  operators  in  clausal  spaces  ordered  by  0OI-subsumption  [4,  14]. 
Extending  our  framework  to  spaces  ordered  by  Ol-implication,  we  intend  to 
investigate  whether  even  in  these  search  spaces  ideal  operators  can  be  specified. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  recalls  the  basic 
notions  of  the  representation  language  and  introduces  the  new  ordering  relation¬ 
ships  that  we  propose,  while  Section  3  deals  with  the  operators  for  searching  in 
the  resulting  spaces.  In  Section  4,  a  novel  framework,  based  on  Object  Identity, 
that  overcomes  negative  results  on  standard  search  spaces  is  presented.  Lastly, 
Section  5  draws  some  conclusions. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  109-118,  2000. 
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2  Preliminaries 

In  our  framework,  we  adopt  a  representation  language  £  expressing  theories 
as  logic  programs  made  up  of  claused.  It  is  based  essentially  on  the  following 
assumption. 

Assumption  2.1  (Object  Identity).  In  a  clause,  terms  denoted  with  different 
symbols  must  be  distinct,  i.  e.  they  represent  different  entities  of  the  domain. 

This  notion  constitutes  the  basis  of  the  novel  generality  orderings  proposed 
in  the  paper. 

2.1  Generality  Orderings 

Essentially,  generalization  can  be  cast  as  a  search  problem  [9].  Hence,  a  major 
issue  is  the  algebraic  organization  underlying  the  search  space. 

Definition  2.2.  Given  a  set  S,  a  binary  relation  -<  on  S  is  a  quasi-ordering 
on  S  iff  it  is  reflexive  and  transitive;  a  quasi- ordering  ■<  induces  an  equivalence 
relationship,  denoted  with  such  that :  VC,  D£S:C~D  iff C^DAD^C. 
Given  two  clauses  C  and  D,  we  say  that  such  a  relationship  holds  properly, 
denoted  with  C  -<  D,  when  C  <  D  A  D  yf,  C. 

Implication  and  0-subsumption  are  the  standard  ordering  relationships  in¬ 
vestigated  in  inductive  logic  programming.  We  weaken  them  in  order  to  obtain 
more  manageable  relationships  leading  to  the  definition  of  a  form  of  implication 
that  complies  with  the  object  identity  assumption. 


0OI-subsurnption.  In  order  to  cope  with  the  object  identity  principle,  we  have 
derived  a  new  ordering  relationship  from  the  classic  0-subsumption,  that  induces 
a  quasi-ordering  upon  the  (Datalog  [2])  clausal  spaces  [14,  3]. 

We  discuss  further  properties  required  to  substitutions  in  order  to  fulfill  ob¬ 
ject  identity.  In  fact,  a  substitution  can  be  regarded  as  a  function  mapping  vari¬ 
ables  to  terms.  In  particular,  we  are  interested  here  in  a  specific  type  of  injective 
mappings. 

Definition  2.3.  Given  a  set  of  terms  T,  let  a  be  a  substitution.  We  say  a  is  an 
OI-substitution  w.r.t.  T  iffVti,t2  €  T  :  t\  ^  t^  =>■  tier  ^  t2<J. 

Hence,  we  introduce  a  new  relationship,  based  on  0-subsumption,  which  com¬ 
plies  with  Assumption  2.1: 

Definition  2.4.  Given  two  clauses  C  and  D,  C  0-subsumes  D  under  object 
identity  (C  9orsubsumes  D)  iff  Bo  OI-substitution  w.r.t.  terms(C)  such  that 
Co  C  D.  Then,  we  say  that  C  is  more  general  or  equivalent  to  D  (resp.  D  is 
more  specific  or  equivalent  to  C)  under  object  identity  and  we  write  D<o,C . 


1  Basic  notions  about  clausal  representation  can  be  found  in  [8,  11]. 
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$oi  -subsumption  is  strictly  a  weaker  relationship  than  standard  implication 
and  d-subsumption  [14]. 

Since  d0i-subsumption  maps  each  literal  of  the  subsuming  clause  onto  a  single 
literal  in  the  subsumed  one,  equivalent  clauses  under  <^>,  must  have  the  same 
number  of  literals.  Thus,  a  search  space  ordered  by  d0i-subsumption  is  made  up 
of  non-redundant  clauses2.  As  a  consequence,  it  is  possible  to  prove  the  following 
result: 

Proposition  2.1.  Let  C  and  D  be  two  clauses.  Then  C~0/D  iff  they  are  alpha¬ 
betic  variants. 

Implication  and  0-subsumption.  A  characterization  of  implication  with  re¬ 
spect  to  d-subsumption  was  given  by  Bain  and  Muggleton  [1] .  This  result  bridges 
the  gap  between  these  two  relationships.  Indeed,  it  states  that  logical  implication 
between  clauses  can  be  divided  in  two  separate  steps:  a  derivation  by  resolution 
[8]  and  then  a  subsumption  step. 

We  recall  a  special  case  of  the  subsumption  theorem,  recently  re-proven 
with  respect  to  various  resolution  mechanisms  (general,  linear,  SLD)  [11].  In 
our  framework,  we  deal  with  linear  resolution,  hence  the  following  definition  is 
needed: 

Definition  2.5.  Let  T  be  a  set  of  clauses.  Then,  the  n-th  linear  resolution  of 
T,  denoted  by  Cn(T),  is  defined  inductively  as  follows: 

•  £}{T)  —  T 

•  Cn(T)  =  {R  |  C  £  Cn~1(T),  D  €  T  U  Ck(T),  k  <  n,  and  R  is  a  resolvent  of 

C  and  D}  ( n  >  1) 

Now  we  can  state  the  corresponding  subsumption  theorem  as  follows: 

Theorem  2.1  (Subsumption  Theorem).  Let  C,D  be  clauses  (D  non-tauto- 
logical).  Then  C  =$■  D  iff3E  €  £T({C})  (n  >  0)  such  that  E  9-subswmes  D. 

C  can  be  resolved  with  itself  only,  or  with  one  of  its  resolvents.  Self  resolution 
is  possible  when  C  is  a  recursive*  clause.  Otherwise,  implication  for  non  recursive 
(or  even  non  ambivalent)  clauses  is  equivalent  to  d-subsumption  [6] . 

2.2  Ol-implication 

Derived  forms  of  implication  are  studied  here  in  order  to  comply  with  the  object 
identity  assumption.  Given  a  notion  of  ((,,-subsumption,  and  using  Theorem 
2.1,  we  can  define  a  novel  generalization  ordering.  The  goal  here  is  to  define 
constructively  implication  under  object  identity. 

First,  we  have  to  define  a  form  of  resolution  coping  with  the  object  identity 
assumption.  From  the  notion  of  Ol-substitution,  we  can  specify  a  notion  of  a 
unifier  fulfilling  Assumption  2.1: 

2  A  clause  is  called  redundant  when  it  is  equivalent,  w.r.t.  a  given  ordering,  to  one  of 
its  subsets. 

3  A  clause  is  recursive  iff  there  exist  literals  A,  ->B  such  that  A  is  unifiable  with  a 
variant  of  B. 
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Definition  2.6.  Given  a  finite  set  of  simple  expressions  S,  we  say  that  9  is 
an  Ol-unifier  iff  3 E  VEi  G  S  :  Erf  =  E  and  9  is  an  OI-substitution  w.r.t. 
terms(Ei) .  An  01-unifier  9  for  S  is  called  a  most  general  Ol-unifier  (mguo;,) 
for  S  iff,  for  each  Ol-unifier  a  of  S,  there  exists  an  OI-substitution  r  such  that 
o  =  6t. 

Differently  from  [3],  we  derive  our  definition  of  OI-resolution  from  [13]: 

Definition  2.7.  Given  the  clauses  C  and  D,  standardized  apart,  the  clause  R 
is  an  OI-resolvent  of  C  and  D  iff,  given  two  subsets  M  C  C  and  N  C  D  such 
that  {M,  N}  4  is  unifiable  via  the  mguOI  9,  it  holds  that: 

R  =  ((C  \  M)  U  (D  \  N))9. 

As  mentioned  before,  we  will  consider  only  the  case  of  linear  resolution  [11]. 
We  will  denote  with  Col  the  linear  OI-resolution  operator,  and  with  £*,  its 
closure.  If  C  can  be  derived  by  means  of  zero  or  more  (linear)  resolution  steps 
from  the  set  of  clauses  T,  this  will  be  denoted  with  T  h0i  C  or  C  £  £",(T),  n  >  0. 
We  can  now  define  the  form  of  implication,  that  copes  with  object  identity. 

Definition  2.8.  Let  C  and  D  be  any  two  clauses.  C  implies  D  under  object 
identity  (equivalently,  C  OI-implies  D),  denoted  C  =>0,  D  iff  either  D  is  a  tau¬ 
tological  clause  or  there  exists  a  clause  E  G  ££,({C})  such  that  E  9orsubsumes 
D.  In  this  case  we  say  that  C  is  more  general  or  equivalent  to  D  (resp.  D  is 
more  specific  or  equivalent  to  C)  under  Ol-implication.  Equivalence  under  01- 
implication  is  denoted  by  <&ol. 

It  is  easy  to  see  that  Ol-implication  is  strictly  a  stronger  ordering  relationship 
than  0OI-subsumption. 

3  Refinement  Operators 

Our  learning  problem  is  cast  as  a  search  problem.  In  this  section  we  focus  on 
the  properties  of  the  operators  that  perform  this  search. 

Theory  refinement  is  triggered  by  new  evidence  made  available  to  be  assimi¬ 
lated  in  a  knowledge  base.  Generally  speaking,  the  canonical  inductive  paradigm 
requires  the  fulfillment  of  the  properties  of  completeness  and  consistency  for  the 
synthesized  theory  with  respect  to  a  set  of  input  examples.  When  an  inconsistent 
(respectively,  incomplete)  hypothesis  is  detected,  a  specialization  (respectively, 
generalization)  of  the  hypothesis  is  required  in  order  to  restore  this  property 
of  the  theory.  Roughly  speaking,  in  the  former  case  weaker  clauses  must  be 
searched;  in  the  latter,  stronger  clauses  are  needed  or  new  ones  are  to  be  intro¬ 
duced.  Formally,  in  terms  of  the  adopted  ordering: 

Definition  3.1.  Given  a  quasi-ordered  set  of  clauses  (£,  <),  a  refinement  oper¬ 
ator  is  a  mapping  from  £  to  2C  such  that: 

•  VC  G  £  :  p(C)  C  {D  G  £[  D  <C)  (downward  refinement  operator) 

•  VC  G  £  :  (5(C)  C  [D  G  £[  C  <  D)  (upward  refinement  operator) 


4  We  indicate  with  L  the  complement  of  a  (set  of)  literal(s). 
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A  notion  of  closure  upon  refinement  operators  will  be  useful  when  proving 
the  completeness  property  for  the  operators. 

Definition  3.2.  Given  a  quasi-ordered  set  (£,  fi),  let  r  be  a  refinement  operator 
and  C  G  C.  The  closure  of  r  (in  symbols  f * )  for  C  is  such  that: 
r*(C)  =  Un>0  rn(C)  =  r°(C)  Ur'(C)U...U  r"(C)  U  . . . 
where  Tn(C )  is  inductively  defined  as: 

.r°(C)  =  {C} 

•  rn(C)  =  {D|  3E  e  rn_1(C')  :  D  G  t(E)} 

Ultimately,  refinement  operators  should  construct  chains  of  refinements  from 
the  initial  hypotheses  to  target  ones.  The  next  definition  introduces  this  notion. 

Definition  3.3.  In  a  quasi-ordered  set  (£,  ■<),  given  a  refinement  operator  t,  a 
sequence  of  clause  Co,  Ci, . . . ,  Cn  in  C  is  a  T-chain  iff  Ci  G  r(Cj_i),  1  <  i  <  n. 


Properties  of  the  Refinement  Operators.  We  specify  the  properties  that 
confer  ideality  to  a  refinement  operator  by  recalling  the  definitions  in  [15].  First, 
we  define  the  property  that  is  fundamental  to  construct  refinement  operators 
that  are  actually  mechanizable. 

A  major  source  of  inefficiency  in  computing  refinements  may  come  from 
clauses  that  turn  out  to  be  equivalent  to  the  starting  ones.  Indeed,  it  is  de¬ 
sirable  that  the  chain  of  refinements  leads  directly  to  target  elements.  Depend¬ 
ing  on  the  search  algorithm  adopted,  refinements  that  axe  equivalent  to  some 
element  already  discarded  introduce  a  lot  of  useless  computation.  As  to  the  ef¬ 
fectiveness  of  the  search,  a  refinement  operator  should  be  able  to  build  chains 
between  two  any  comparable  elements  of  the  search  space  (or  their  equivalent 
representatives).  This  means  that  a  complete  refinement  operator  can  derive  any 
comparable  element  in  a  finite  number  of  steps. 

The  following  definitions  formally  specify  these  concepts: 

Definition  3.4.  In  a  quasi-ordered  set  (£,  fi),  a  refinement  operator  t  is  locally 
finite  iff  VC  G  £  :  r(C)  is  finite  and  computable. 

A  downward  (resp.  upward)  refinement  operator  p  (6)  is  proper  iff  VC  G  £  : 
D  G  p(C)  implies  D  -<  C  (resp.  VC  G  £  :  fiG  8(C)  implies  C  -<  D). 

A  downward  (resp.  upward)  refinement  operator  p  (6)  is  complete  iff  VC,  D  G 
£,  D  -<  C  implies  3 E  G  £  :  E  G  p*{C)  and  E  ~  D  (resp.  C  -<  D  implies 
3j E  G  £  :  E  G  8*(C)  and  E  ~  D). 

The  combination  of  these  three  properties  confers  effectiveness  and  efficiency 
to  an  operator.  Indeed,  local  finiteness  and  completeness  ensure  the  presence  of  a 
computable  refinement  chain  to  a  target  element.  Besides,  properness  makes  the 
refinement  process  more  efficient,  by  avoiding  the  search  of  equivalent  clauses. 
The  following  definition  accounts  for  all  of  them. 

Definition  3.5.  In  a  quasi-ordered  set  (£,  <),  a  downward  (resp.  upward)  re¬ 
finement  operator  p  (8)  is  ideal  iff  it  is  locally  finite,  proper  and  complete. 
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Nonexistence  conditions  for  ideal  refinement  operators  in  unrestricted  set  of 
clauses  ordered  by  0-subsumption  are  given  in  [15]: 

Theorem  3.1.  In  an  unrestricted  search  space  (C,  <g),  with  at  least  one  predi¬ 
cate  symbol  of  arity  >  1,  an  ideal  upward  refinement  operator  does  not  exist. 

Similar  results  apply  for  downward  refinement  in  the  same  search  space  or 
also  when  a  stronger  ordering  relationship  like  implication  is  adopted. 


4  Ideal  Operators  for  Ol-implication 

In  this  section  we  propose  refinement  operators  for  spaces  ordered  by  0OI-sub- 
sumption  and  Ol-implication  for  an  unrestricted  search  space.  We  also  present 
the  results  on  their  ideality,  though  for  brevity  the  proofs  are  omitted,  but  they 
can  be  found  in  [5]. 

We  show  here  how  it  is  possible  to  define  ideal  refinement  operators  in 
function-free  clausal  spaces  under  the  weaker,  but  more  mechanizable,  order¬ 
ing  induced  by  0Oi-subsumption.  Ideal  refinement  operators  in  this  space  have 
been  defined  in  [14,  3]. 

Given  the  notion  of  OI-substitution,  we  extend  the  definition  of  the  relation¬ 
ship  <OI  (and  then  the  refinement  operators)  to  the  case  of  unrestricted  search 
spaces  ordered  by  0orsubsumption.  Indeed,  an  easy  characterization  that  can  be 
made  is  that  D  0Oi-subsumes  C  whenever  an  OI-substitution  a  exists,  such  that 
Da  C  C.  We  extend  the  definition  of  the  refinement  operators  to  include  also 
the  case  of  functions. 

Definition  4.1.  Let  C  be  a  clause.  Then  D  €  fbi(C)  when  one  of  these  condi¬ 
tions  holds: 

1.  D  =  C6,  where  0  =  {X/a},  X  £  vars(C ),  a  £  consts(C); 

2.  D  —  C6,  where  9  =  {X/f(Yi, . . .  ,Yn)},  f  is  an  n-ary  function  symbol 
( n  >  0)  and  X  £  vars(C); 

3.  D  =  C  U  {£},  where  L  is  a  literal,  such  that:  L  £  C. 

D  £  8ol(C)  when  one  of  these  conditions  holds: 

1.  D  =  Ca,  where  a  =  { a/X },  a  £  consts(C),  X  £  vars(C); 

2.  D  =  Ca,  where  9  =  {f(Yi,...,Yn)/X},  f  is  an  n-ary  function  symbol 
(n  >  0)  such  that  f(Y\, . . .  ,Yn)  £  terms(C )  and  X  £  vars(C); 

3.  D  =  C  \  {L} ,  where  L  is  a  literal,  such  that:  L  £  C. 

Even  in  this  case,  we  obtain  ideal  refinement  operators  for  this  search  space. 

Theorem  4.1.  In  an  unrestricted  clausal  space,  the  operators  /?j;  and  SOI  are 
ideal  refinement  operators. 

When  dealing  with  non  recursive  clauses,  generalization  and  specialization 
under  implication  (respectively  Ol-implication)  correspond  to  the  cases  consid¬ 
ered  for  the  0-subsumption  (0^-subsumption),  because  of  Gottlob’s  theorem  [6]. 
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Now,  we  want  an  operator  for  computing  the  resolution  and  inverse  resolution 
steps,  in  the  case  of  recursive  clauses  (for  the  subsumption  theorem).  [10]  intro¬ 
duced  the  notions  of  powers  and  roots  as  operations  where  a  clause  is  resolved 
with  itself.  They  can  be  considered  as  refinement  operators. 

Definition  4.2.  Let  C  be  a  clause.  A  clause  D  is  an  n-th  power  of  C  iff  D  is  a 
variant  of  a  clause  in  Cn(C)  (n  >  1).  We  also  say  that  C  is  an  n-th  root  of  D. 

By  exploiting  the  subsumption  theorem,  a  way  to  obtain  downward  refine¬ 
ments  of  a  clause  in  a  search  space  ordered  by  implication  is  to  self-resolve  a 
clause  n  times  to  obtain  an  n-th  power  or  to  apply  the  downward  refinement 
operator  used  for  6>0i-subsumption.  Conversely,  upward  refinements  require  to 
compute  an  n-th  root  of  a  clause,  and  again,  to  employ  an  upward  refinement 
operator  for  0Oi-subsumption.  Both  cases  are  not  practically  feasible,  since  we 
do  not  know  a  priori  the  n  to  stop  at  in  the  process  of  self-resolution.  Moreover, 
while  it  is  clear  how  to  compute  n-th  powers  by  using  linear  resolution,  in  order 
to  find  downward  refinements  of  a  clause,  the  dual  is  a  more  complex  task  since 
it  yields  inversion  steps. 


Inverting  OI-Resolution.  We  deal  with  the  problem  of  inverting  resolution 
by  adapting  the  technique  presented  in  [7]  to  our  framework.  Specifically,  we 
start  by  defining  a  way  to  construct  parent  clauses  given  the  resolvent,  then  we 
generalize  this  process  with  the  aim  of  constructing  OI-ancestors  of  the  starting 
clause. 

Given  a  clause  R,  it  can  be  considered  as  the  resolvent  of  two  clauses  C  and 
D  according  to  the  definition  of  resolution,  such  that: 

R  =  ((C\M)U(D\N))0. 

where  C  and  D  are  standardized  apart  and  9  is  an  mguOI  both  for  {M,N}. 

Besides,  the  most  specific  parent  clauses  resolve  upon  just  one  literal  —  M 
and  N  are  singletons  Lc  and  Lo,  respectively  —  and  inherit  all  literals  from 
the  Ol-resolvent,  hence  9  has  to  be  an  mguol  also  for  {C  \  {Lc} ,  D  \  {Lo}}  5: 
(C\{Lc})e  =  (D\{LD})9  =  R 

Thus: 

C  —  R  U  {L}  and  D  =  R  U  {L}. 
where  L  =  Lc  =  Lp. 

Hence,  by  introducing  a  new  literal,  we  have  obtained  two  parent  clauses. 
This  applies  to  the  cases  of  ambivalent  clauses.  It  holds: 

Proposition  4.1.  Let  R  be  a  clause  and  L  be  a  literal.  Then:  {/?}  <S=>0/  (R  U 
{L},Ru{L}). 

It  is  also  provable  [5]  that: 

Proposition  4.2.  Let  C  and  D  be  two  clauses  and  R  an  Ol-resolvent  of  C  and 
D.  Then  there  exists  a  literal  L  such  that  C  >OI  R  U  {L}  and  D  >ol  R  U  {L}. 


5  The  extension  of  unifiers  to  from  simple  expressions  formulae  is  straightforward. 
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By  using  the  same  technique  iteratively,  applied  this  time  to  invert  more 
than  just  one  resolution  step,  we  compute  clauses  from  which  R  follows  in  two 
steps  and  so  on,  by  introducing  other  literals.  Namely,  given:  {RU  {£},  R U  {L}} 
we  can  decide  to  invert  either  of  the  two  parent  clauses.  Then  we  can  apply  the 
or-introduction  technique  to  the  clause  chosen  (say,  the  first),  obtaining: 

{R  U  {L}  U  {!/'},  R  U  {L}  U  {X7},  R  U  {L}}. 

This  technique  can  be  extended  by  defining  the  following  notion: 

Definition  4.3.  Let  C  be  a  clause  and  ft  be  a  sequence  of  literals.  Then,  a  set 
of  clauses  S  is  or-introduced  from  C  by  ft  iff: 

1.  S  =  {C1}  and  ft  =  []  or 

2.  S  =  (S'  \  D)  U  {D  U  {L},  D  U  {L}}  and  ft  =  [L\, . . . ,  Ln,  L],  where  S'  is  a 
set  of  clauses  or-introduced  from  C  by  [Lj, . . . ,  Ln ]  and  D  £  S'. 

Logical  equivalence  holds  after  this  step  of  inversion  [5] : 

Theorem  4.2.  Let  S  be  a  set  of  clauses  or-introduced  from  clause  C.  Then 
S  =o,  {<?}. 

A  sequence  of  resolutions  can  be  inverted  by  applying  or-introduction  of  a 
sequence  of  literals  [5] : 

Theorem  4.3.  Let  T  be  a  set  of  clauses,  D  a  clause  in  C^,(T).  Then  there 
exists  a  set  of  clauses  S  or-introduced  from  D,  3 C  £  T  such  that  \/E  £  S  :  C 
6or subsumes  E. 

OI- Expansions.  We  have  seen  how  starting  from  a  clause  it  is  possible  to  obtain 
a  set  of  generalizations  that  is  logically  equivalent  to  an  n-th  power  of  the  clause 
while  this  0Oi-subsumes  the  clauses  in  the  set.  The  goal  is  to  reduce  resolution 
to  subsumption  mechanisms.  Thus,  we  come  to  the  actual  computation  of  the 
upward  refinements  of  clauses  by  using  the  notion  of  expansions. 

Definition  4.4.  Let  C  be  a  clause  and  ft  a  sequence  of  literals.  Then  a  clause 
E  is  an  OI-expansion  of  C  by  ft  iff  E  is  a  least  general  generalization  under 
0or subsumption  of  a  set  of  clauses  or-introduced  from  C  by  the  sequence  ft. 

The  notion  of  least  general  generalization  under  (fr  sub  sumption  (lgg0i)  [14] 
used  in  this  definition  is  a  transposition  of  Plotkin’s  Igg  [12], 

Idestam-Almquist  shows  that  his  technique  is  practically  infeasible,  since  it 
leads  to  an  exponential  growth  of  the  computed  expansion.  Indeed,  he  proves 
that  if  n  is  the  number  of  literals  or-introduced  to  compute  an  expansion  E  of  a 
clause  C  such  that  [671  =  m,  then  the  maximal  cardinality  of  E  is  (m  +  n)"+1. 
Instead,  in  our  framework  [5]: 

Theorem  4.4.  Let  C  be  a  clause  (|Cj  =  m),  S  a  set  of  clauses  or-introduced 
from  C  by  [L\, . . . ,  Ln],  and  E  an  Iggoi  of  S.  Then  [El  <  (m  +  n). 

Another  important  property  about  Ol-expansions  of  a  clause  is  that  they  are 
logically  equivalent  to  it. 


Refining  Logic  Theories  under  Ol-Implication  117 


Theorem  4.5.  Let  C  be  a  clause  and  E  its  OI-expansion  by  some  sequence  Q. 
Then  C  <ri>0/  E. 

The  main  result  is  the  following  [5]. 

Theorem  4.6.  Given  two  clauses  C  and  D,  D  non  tautological,  if  C  Ol-implies 
D  then  there  exists  an  expansion  E  of  D  such  that  C  ft ,,-subsumes  E. 

Hence,  we  can  define  refinement  operators  S'OI  and  p'OI  for  spaces  ordered  by 
OI-implication: 

Definition  4.5.  Let  C  be  a  clause,  then: 

•  D  G  h'OI(C)  iff3E,  E  expansion  of  C,  and  D  G  Sor(E); 

•  D  G  p'01(C)  iff  E  G  Tq;({C}),  for  some  n,  and  D  G  pOI(E). 

As  regards  the  properties  of  these  operators,  we  have  already  remarked  as 
computing  the  n-th  powers  of  a  clause  in  the  definition  of  p'ol  is  a  merely  the¬ 
oretical  issue,  for  the  algorithm  cannot  know  a  priori  which  n  to  stop  at.  This 
yields  a  non  locally  finite  operator.  It  is,  instead,  surely  a  proper  and  complete 
operator  for  the  properness  and  completeness  of  pol.  Conversely, 

Theorem  4.7.  In  a  space  ordered  by  OI-implication,  Sol  is  an  ideal  upward 
refinement  operator. 

5  Conclusions  and  Future  Work 

Many  problems  encountered  in  ILP  are  theoretically  or  practically  infeasible. 
Therefore,  biasing  them  can  help  to  find  solutions  in  significant,  yet  restricted 
cases.  This  work  is  an  effort  in  this  direction:  in  our  framework,  the  language  was 
not  deprived  of  representation  power,  however  the  complexity  of  the  refinement 
operators  was  reduced  because  of  the  bias  on  the  search  space. 

The  work  presented  regarded  the  definition  of  a  framework  fulfilling  the  prop¬ 
erty  of  ideality  of  refinement  operators,  that  guarantees  for  the  efficiency  and 
effectiveness  of  the  learning  process.  While  such  operators  have  been  proven  not 
to  exist  in  the  spaces  ordered  by  the  notions  of  implication  or  ft-subsumption, 
in  our  framework,  relying  on  the  Object  Identity  assumption,  we  have  weakened 
the  implication  ordering,  obtaining  OI-implication,  that  allows  for  the  existence 
of  ideal  operators  in  the  corresponding  search  spaces. 

Future  work  will  concern  a  deeper  investigation  of  the  properties  of  OI- 
implication.  OI-implication  seems  to  be  promising  since  it  appears  more  mecha¬ 
nizable  than  implication,  yet  the  relationships  holding  between  this  ordering  and 
the  others  presented  in  this  work  deserve  further  study.  For  the  moment,  we  have 
stated  that  OI-implication  is  strictly  weaker  than  unconstrained  implication  and 
stronger  than  ft0i-subsumption.  In  addition,  we  have  given  an  ideal  upward  refine¬ 
ment  operator  for  search  spaces  ordered  by  OI-implication.  A  model-theoretic 
definition  of  this  notion  ought  to  be  given  together  with  the  proof  of  its  decid¬ 
ability.  Hence,  it  should  be  easy  to  define  ideal  downward  operators. 
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Abstract.  Rule  quality  measures  can  help  to  determine  when  to  stop  ge¬ 
neralization  or  specification  of  rules  in  a  rule  induction  system.  Rule  qua¬ 
lity  measures  can  also  help  to  resolve  conflicts  among  rules  in  a  rule  clas¬ 
sification  system.  We  enlarge  our  previous  set  of  statistical  and  empirical 
rule  quality  formulas  which  we  tested  earlier  on  a  number  of  standard 
machine  learning  data  sets.  We  describe  this  new  set  of  formulas,  perfor¬ 
ming  extensive  tests  which  also  go  beyond  our  earlier  tests,  to  compare 
these  formulas.  We  also  specify  how  to  generate  formula-behavior  ru¬ 
les  from  our  experimental  results,  which  show  the  relationships  between 
a  formula’s  performance  and  the  characteristics  of  a  dataset.  Formula- 
behavior  rules  can  be  combined  into  formula-selection  rules  which  can 
select  a  rule  quality  formula  before  rule  induction  takes  place.  We  report 
the  experimental  results  showing  the  effects  of  formula-selection  on  the 
predictive  performance  of  a  rule  induction  system. 


1  Introduction 

A  rule  induction  system  generates  decision  rules  from  a  set  of  training  data.  The 
set  of  decision  rules  determines  the  performance  of  a  classifier  that  exploits  the 
rules  to  classify  unseen  objects.  It  is  therefore  important  for  a  rule  induction 
system  to  generate  decision  rules  that  have  high  predictability  or  reliability. 
These  properties  are  commonly  measured  by  a  function  called  rule  quality.  A  rule 
quality  measure  is  needed  in  both  the  rule  induction  and  classification  processes. 
In  rule  induction,  a  rule  quality  measure  can  be  used  as  a  criterion  in  the  rule 
specification  andor  generalization  process.  In  classification,  a  rule  quality  value 
can  be  associated  with  each  rule  to  resolve  conflicts  when  multiple  rules  are 
satisfied  by  the  example  to  be  classified. 

We  survey  a  number  of  statistical  and  empirical  rule  quality  measures,  some 
of  which  have  been  discussed  by  Bruha  [7,8]  and  An  and  Cercone  [3].  In  our 
earlier  work  [3],  we  evaluated  some  of  these  formulas  on  a  smaller  collection  of 
data  sets.  One  contribution  of  this  paper  is  to  include  more  formulas  in  our 
experiments  and  the  tests  also  go  beyond  our  earlier  tests  by  including  data  sets 
in  the  experiments.  In  our  evaluation,  ELEM2  [2]  is  used  as  the  basic  learning  and 
classification  algorithms.  We  report  the  experimental  results  from  using  these 
formulas  in  ELEM2  and  compare  the  results  by  indicating  the  significance  level 
of  the  difference  between  each  pair  of  the  formulas.  In  addition,  the  relationship 
between  the  performance  of  a  formula  and  a  dataset  is  obtained  by  automatically 
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generating  formula- behavior  rules  from  a  dataset  that  describes  the  experimental 
results  for  the  formulas  and  the  characteristics  of  the  datasets.  The  formula- 
behavior  rules  are  further  combined  into  formula-selection  rules  which  can  be 
employed  by  ELEM2  to  select  a  rule  quality  formula  before  inducing  rules  from 
a  dataset.  We  report  the  experimental  results  showing  the  effects  of  formula- 
selection  on  ELEM2’s  predictive  performance. 

2  Rule  Quality  Measures 

Many  rule  quality  measures  are  derived  by  analyzing  the  relationship  between 
a  decision  rule  R  and  a  class  C.  The  relationship  can  be  depicted  by  a  2  x  2 
contingency  table  [4,7]: 

Table  1.  Contingency  Table  with  Absolute  Frequencies 


Class  C 

Not  class  C 

Covered  by  rule  R 

Tire 

Tire 

Tir 

Not  covered  by  R 

Tire. 

Tire 

Tif 

nc 

Tic 

N 

where  nrc  is  the  number  of  training  examples  covered  by  rule  R  and  belonging  to 
class  C;  nre  is  the  number  of  training  examples  covered  by  R  but  not  belonging 
to  C,  etc;  N  is  the  total  number  of  training  examples;  nr,  nf,  nc  and  n5  are 
marginal  totals,  e.g.,  nr  =  nrc  +  nrZ,  which  is  the  number  of  examples  covered 
by  R.  The  contingency  table  can  also  be  presented  using  relative  rather  than 
absolute  frequencies  as  follows: 

Table  2.  Contingency  Table  with  Relative  Frequencies 


Class  C 

Not  class  C 

Covered  by  rule  R 

frc 

frc 

Sr 

Not  covered  by  R 

frc 

frc 

Sf 

Sc 

fc 

1 

where  frc  =  frS  =  22ff-,  and  so  on. 

2.1  Empirical  Formulas 

Empirical  rule  quality  formulas  are  based  on  intuitive  logic.  We  describe  two 
empirical  formulas  that  combine  two  basic  characteristics  of  a  rule:  consistency 
and  coverage.  Using  the  elements  of  the  contingency  table,  the  consistency  of  a 
rule  R  can  be  defined  as  cons(R)  =  and  its  coverage  as  cover (R)  = 


Weighted  Sum  of  Consistency  and  Coverage.  Michalski  [13]  proposes  to 
use  the  weighted  sum  of  the  consistency  and  coverage  as  a  measure  of  rule  quality 
as  follows: 

Qws  =  wi  x  cons(R)  +  x  cover(R) 
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where  w i  and  u>2  are  user-defined  weights  with  their  values  belonging  to  (0, 1) 
and  summed  to  1.  This  formula  is  applied  in  an  incremental  learning  system 
YAILS  [15].  The  weights  in  YAILS  are  specified  automatically  as:  w±  =  0.5  + 
^cons(R)  and  v>2  =  0.5  —  ^cons(R).  These  weights  depend  on  consistency.  The 
larger  the  consistency,  the  more  influence  consistency  has  on  rule  quality. 

Product  of  Consistency  and  Coverage.  Brazdil  and  Torgo  [6]  propose  to 
use  a  product  of  consistency  and  coverage  as  rule  quality: 

Qprod  =  consR  x  f  (cover  (R)) 

where  /  is  an  increasing  function.  The  authors  conducted  a  large  number  of 
experiments  and  chose  to  use  the  following  form  of  /:  f(x)  =  ex~1.  This  setting 
of  /  makes  the  difference  in  coverage  have  smaller  influence  on  rule  quality,  which 
results  in  the  rule  quality  formula  to  prefer  consistency. 

2.2  Measures  of  Association 

A  measure  of  association  indicates  a  relationship  between  the  classification  for 
the  columns  and  the  classification  for  the  rows  in  the  2x2  contingency  table. 

Pearson  y 2  Statistic.  The  y2  statistic  is  based  on  the  assumption:  if  the 
classification  for  the  columns  is  independent  of  that  for  the  rows,  the  frequencies 
in  the  cells  of  the  contingency  table  should  be  proportional  to  the  marginal 
totals.  The  y2  value  is  given  by 

2  —  NT'  (no  ~  ne)2 
X  ~  ^  ne 

where  n0  is  the  observed  absolute  frequency  of  examples  in  a  cell,  and  ne  is 
the  expected  absolute  frequency  of  examples  for  the  cell.  For  example,  for  the 
upper- left  cell,  n0  =  nrc  and  ne  =  The  value  is  computed  for 

each  cell  of  the  table  individually  and  the  values  for  all  cells  are  added  to  yield 
the  value  of  y2.  This  value  measures  whether  the  classification  of  examples  by 
rule  R  and  one  by  class  C  are  related.  The  lower  the  y2  value,  the  more  likely  it 
is  that  the  correlation  between  R  and  C  is  due  to  chance. 

G2  Likelihood  Ratio  Statistic.  The  G2  likelihood  ratio  measures  the  di¬ 
stance  between  two  distributions:  the  observed  frequency  distribution  of  exam¬ 
ples  among  classes  satisfying  the  rule  R  and  the  expected  frequency  distribution 
of  the  same  number  of  examples  under  the  assumption  that  the  rule  R  selects 
examples  randomly.  The  value  of  this  statistic  can  be  obtained  using  the  absolute 
frequencies  in  the  contingency  table  as  follows: 

G2  =  2(T^logr^-  + 

Thf  TlrTlc  7lr  TX-pTlg 

where  the  logarithm  is  of  base  e.  The  lower  the  G2  value,  the  more  likely  it  is 
that  the  apparent  association  between  the  two  distributions  is  due  to  chance. 
Both  the  y2  and  the  likelihood  ratio  statistics  are  distributed  asymptotically  as 
y2  with  one  degree  of  freedom. 
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2.3  Measures  of  Agreement 

A  measure  of  agreement  concerns  the  association  of  the  elements  of  a  contingency 
table  on  its  main  diagonal  only  [7]. 


Cohen’s  Formula.  We  can  measure  the  actual  agreement  by  simply  summing 
up  the  main  diagonal  using  the  relative  frequencies:  frc+fre-  A  chance  agreement 
occurs  if  the  row  variable  is  independent  of  the  column  variable,  which  can 
measured  by  frfc  +  frfc-  Cohen  [9]  suggests  to  compare  the  actual  agreement 
with  the  chance  agreement  by  using  the  normalized  difference  of  the  two  which 
we  can  use  as  a  rule  quality  measure: 


Q  Cohen  — 


frc  +  frc  ~  ( frfc  +  frfc) 
1  ~  ( frfc  +  frfc) 


When  both  elements  frc  and  frc  are  reasonably  large,  Cohen’s  statistic  gives  a 
higher  value  which  indicates  the  agreement  on  the  main  diagonal. 


Coleman’s  Formula.  Coleman  [5,7]  defines  a  measure  of  agreement  that  in¬ 
dicates  an  association  between  the  first  column  and  any  particular  row  in  the 
contingency  table.  Bruha  [7]  suggests  using  a  modified  version  of  Coleman’s  mea¬ 
sure  for  the  purpose  of  rule  quality  definition,  which  actually  responds  to  the 
agreement  on  the  upper-left  element  of  the  contingency  table.  The  formula  is  also 
derived  by  normalizing  the  difference  between  the  actual  and  chance  agreement 
as  follows: 


Q  Coleman  — 


frc  -  frfc 
fr  ~  frfc 


Cl  and  C2  Formulas.  Further  analysis  indicates  that  Coleman’s  formula 
does  not  properly  comprise  the  coverage  (i.e.  completeness)  of  a  rule.  On  the 
other  hand,  Cohen’s  statistic  is  more  completeness-based.  Therefore,  Bruha  [8] 
modified  Coleman’s  formula  in  two  ways,  which  yields  formulas  Cl  and  C2: 


Qci 


Q  Coleman  X 


2  T  Q Cohen 

3 


Qc2  —  Qcoleman  X 


1  +  cover(R) 
2 


where  the  coefficients  2,  3  and  1,  2  are  used  for  the  normalization  purpose. 


2.4  Measure  of  Information 

The  measure  of  information  is  another  statistical  measurement  that  can  be  used 
to  define  rule  quality.  Given  a  class  C,  the  amount  of  information  necessary  to 
correctly  classify  an  instance  into  class  C  whose  prior  probability  is  P(C)  is 
defined  as  [12]  —logP(C)  [bit],  where  the  log  function  is  of  base  2.  Now  given 
a  rule  R,  the  amount  of  information  we  need  to  correctly  classify  an  instance 
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into  class  C  is  —  logP(C\R)  [bit],  where  P(C\R)  is  the  posterior  probability  of 
C  given  R.  Therefore,  the  amount  of  information  obtained  by  the  rule  R  is 
—logP(C)  +  logP(C\R)  [bit].  Kononenko  and  Bratko  [12]  call  the  value  of  this 
formula  the  information  score,  which  measures  the  amount  of  information  the 
rule  R  contributes.  Using  frequencies  to  estimate  the  probabilities,  the  formula 
can  be  written  as 

Qis  =  -log^  +  log^. 

iV  nr 


2.5  Measure  of  Logical  Sufficiency 

The  logical  sufficiency  measure  is  a  standard  likelihood  ratio  statistic,  which 
have  been  applied  to  measure  rule  quality  [10,1].  Given  a  rule  R  and  a  class  C , 
the  degree  of  logical  sufficiency  of  R  with  respect  to  C  is  defined  by 


Qls  = 


P(R\c) 

P(R\C) 


where  P  denote  probability.  A  rule  for  which  Qls  is  large  means  that  the  ob¬ 
servation  of  R  is  encouraging  for  the  class  C  -  in  the  extreme  case  of  Qls 
approaching  infinity,  R  is  sufficient  to  establish  C  in  a  strict  logical  sense.  On 
the  other  hand,  if  Qls  is  much  less  than  unity,  then  the  observation  of  R  is 
discouraging  for  C.  Using  frequencies  to  estimate  the  probabilities,  the  formula 

nrc 

can  be  expressed  as  Qls  =  -£%?■ 

nc 


2.6  Measure  of  Discrimination 


Another  statistical  rule  quality  formula  is  the  measure  of  discrimination,  which 
is  applied  in  ELEM2  [2],  The  formula  was  inspired  by  a  query  term  weighting 
formula  used  in  the  probability-based  information  retrieval.  The  formula  mea¬ 
sures  the  extent  to  which  a  query  term  can  discriminate  between  relevant  and 
non-relevant  documents  [14].  If  we  consider  a  rule  R  as  a  query  term  in  an  infor¬ 
mation  retrieval  setting,  positive  examples  of  a  class  C  as  relevant  documents, 
and  negative  examples  as  non-relevant  documents,  then  the  following  formula 
can  be  used  to  measure  the  extent  to  which  the  rule  R  can  discriminate  between 
the  positive  and  negative  examples  of  the  class  C: 


Qmd  =  log 


P(R\C)(1  -  P(R\C)) 
P(R\C)(1-P(R\C)) 


where  P  denotes  probability.  The  formula  can  be  estimated  using  the  frequencies 

nrc 

as  Qmd  =  log^ 
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3  Experiments  with  Rule  Quality  Measures 

3.1  Experimental  Design 

We  evaluate  the  rule  quality  formulas  described  in  Section  2  by  determining  how 
different  rule  quality  formulas  affect  the  predictive  performance  of  a  rule  induc¬ 
tion  system,  ELEM2.  In  ELEM2,  a  rule  quality  formula  is  used  in  both  post- 
pruning  and  classification  processes.  In  post-pruning,  removal  of  an  attribute- 
value  pair  depends  on  whether  it  will  decrease  the  quality  value  of  the  rule.  In 
classification,  the  rule  quality  formula  is  used  to  help  resolve  conflicts  among 
rules.  In  our  experiments,  we  run  versions  of  ELEM2,  each  of  which  uses  a  diffe¬ 
rent  rule  quality  formula.  The  x2  statistic  is  used  in  two  ways,  in  both  of  which 
the  x2  formula  is  used  as  the  ELEM2  rule  quality  measure.  They  differ  in  the 
method  to  post-prune  a  generated  rule. 

1.  Qx2  In  post-pruning,  the  removal  of  an  attribute- value  pair  depends  on 
whether  the  rule  quality  value  after  removing  an  attribute- value  pair  is  grea¬ 
ter  than  x205i  he.,  the  tabular  x2  value  for  the  significance  level  of  0.05  with 
one  degree  of  freedom.  If  the  calculated  value  is  greater  than  tabular  x2os, 
then  remove  the  attribute-value  pair;  otherwise  check  other  pairs  or  stop 
post-pruning  if  all  pairs  have  been  checked. 

2.  Qx205+  In  post-pruning,  an  attribute-value  pair  is  removed  if  and  only  if  the 
rule  quality  value  Qafter  after  removing  an  attribute-value  pair  is  greater 
than  x205  and  Qafter  is  no  less  than  the  rule  quality  value  before  removing 
the  attribute-value  pair. 

The  G2  statistic,  denoted  as  Qg2.05+,  is  used  in  the  same  way  as  Q x2Qr+,  be., 
a  pair  is  removed  in  post-pruning  if  and  only  if  the  value  of  Qg2.06+  is  greater 
than  x205  and  the  removal  does  not  cause  the  rule  quality  value  to  decrease. 

Our  experiments  are  conducted  using  27  benchmark  datasets  obtained  from 
the  UCI  Repository  of  Machine  Learning  database.  The  datasets  represent  a 
mixture  of  characteristics  ranging  from  2  to  10  classes,  from  4  to  64  condition 
attributes,  and  from  24  to  7491  examples. 

3.2  Results 

On  each  dataset,  we  conduct  the  ten-fold  evaluation  of  a  rule  quality  measure 
using  ELEM2.  The  results  in  terms  of  predictive  accuracy  mean  on  each  dataset 
for  each  formula  are  shown  in  Figure  1.  The  average  of  the  accuracy  means  for 
each  formula  over  the  27  datasets  is  shown  in  Table  3,  where  the  rule  quality 
formulas  are  listed  in  decreasing  order  of  average  accuracy  means.  Whether  a 

Table  3.  Average  of  accuracy  means  for  each  formula  over  the  datasets. 


Qc  2 

Qws 

Qci 

Qls 

Qmd 

Q  C  ol  eman 

Qg2.05+ 

Qis 

Q  Prod 

Qx  2 

.05  + 

QCohen 

Qx2 

x.05 

Average 

81.89 

81.71 

81.61 

FaiKia 

[jjjjujj 

80.65 

79.94 

IrgJKM 

79.59 

78.44 

78.08 

72.42 

formula  with  a  higher  average  is  significantly  better  than  a  formula  with  a  lower 
average  is  determined  by  paired  t-tests.  The  t-test  results  in  terms  of  p-values 


Rule  Quality  Measures  Improve  the  Accuracy  of  Rule  Induction  125 


□  MD  ■  Chi. 05  □  Chi .05+  □  G2.05+  ■  Cohen  0  Coleman  I 

■  IS  DLS  BWS  ■Prod  PCI  E3C2  j 


Fig.  1.  Results  on  the  27  datasets 


are  reported  in  Table  4.  A  small  p-value  indicates  that  the  null  hypothesis  (the 
difference  between  the  two  formulas  is  due  to  chance)  should  be  rejected  in  favor 
of  the  alternative  at  any  significance  level  above  the  calculated  value.  In  Table  4, 
the  p-values  that  are  smaller  than  0.05  are  shown  in  bold-type  to  indicate  that 
the  formula  with  higher  average  is  significantly  better  than  the  formula  with  the 
lower  average  at  the  5%  significance  level. 

Table  4.  Significance  levels  (p-values  from  paired  t-test)  of  improvement. 


Qws 

Qci 

Qls 

Qmd 

<^G2.05-f 

K2E 

Q  Prod 

imm 

KSi 

Qd 

0.0078 

0.0011 

Qws 

- 

MiEM 

0.0465 

im>Rl:li< 

■munU 

Qc  i 

- 

- 

EEgg 

0.0094 

0.0012 

IiIftjflM 

■«m:n-v 

Qls 

- 

- 

- 

0.0278 

HMKEfcl 

_ Qmd _ 

- 

- 

- 

- 

0.5435 

■iJiltKM 

im»n>H 

- 

- 

- 

- 

- 

NA 

0.0378 

0.0626 

0.3573 

0.0609 

0.0908 

0.0024 

1 'BW 

IBS 

NA 

g  -S3 

0.2295 

Qrs 

- 

- 

- 

- 

- 

- 

- 

NA 

0.8059 

0.2532 

0.2632 

0.0058 

- 

- 

- 

- 

- 

- 

- 

- 

NA 

mm 

- 

- 

■ 

■ 

• 

_ 

* 

- 

0.6246 

0.0067 

Q  d  nh.en 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

NA 

nmw:Ml 

wsm 

- 

- 

" 

- 

■ 

“ 

NA 

Generally  speaking,  we  can  say  that,  in  terms  of  predictive  performance, 
Qc'2,  Qws,  Qc i)  Qls  and  Qmd  are  comparable  even  if  their  performance  may 
not  agree  on  a  particular  dataset.  The  same  for  Qcoieman ,  Qg2  05+  ,  Qis  and 
Q Prod-,  and  Qx\&+  and  Q Cohen-  The  performance  of  Qg2.05+  and  QJS  are  not 
only  comparable,  but  also  similar  on  each  particular  dataset  (seen  from  Figure 
1),  which  indicates  that  the  two  formulas  have  similar  trends  with  regard  to 
nrc,nr,nc  and  N  in  the  contingency  table. 
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4  Learning  from  the  Experimental  Results 

From  the  experimental  results,  we  posit  that,  even  if  on  some  datasets  (such  as 
the  breast  cancer  dataset)  the  performance  of  the  learning  system  is  not  very 
sensitive  to  the  rule  quality  formula  used,  the  performance  greatly  depends  on 
the  formula  on  most  of  the  other  datasets.  It  would  be  desirable  that  we  can 
apply  a  “right”  formula  that  gives  the  best  performance  among  other  formulas 
on  a  particular  dataset.  For  example,  even  though  the  formula  Qx 2^  is  not  a  good 
formula  in  general,  it  performs  better  than  other  formulas  on  some  datasets  such 
as  heart  and  lenses.  If  we  can  find  the  conditions  under  which  each  formula  leads 
to  a  good  performance  of  the  learning  system,  we  can  select  “right  formulas”  for 
different  datasets  and  can  improve  the  predictive  performance  of  the  learning 
system  further. 

To  find  out  this  regularity,  we  use  ELEM2  to  learn  the  formula  selection  rules 
from  the  experimental  results  shown  in  the  last  section.  The  learning  problem  is 
divided  into  (1)  learning  the  rules  for  each  rule  quality  formula  that  describe  the 
conditions  under  which  the  formula  produces  “very  good”,  “good”,  “medium” 
or  “bad”  results,  and  (2)  combining  the  rules  for  all  the  formulas  that  describe 
the  conditions  under  which  the  formulas  give  the  “very  good”  results.  The  resul¬ 
ting  set  of  rules  is  the  formula-selection  rules  that  can  be  used  by  the  ELEM2 
classification  procedure  to  perform  formula  selection. 

4.1  Data  Representation 

For  the  purpose  of  learning  formula-behavior  rules,  i.e.,  the  rules  that  describe 
the  conditions  under  which  a  formula  leads  to  “very  good” ,  “good” ,  “medium” , 
or  “bad”  performance,  we  construct  training  examples  from  the  above  results 
and  the  dataset  characteristics.  First,  on  each  dataset,  we  decide  the  relative 
performance  of  each  formula  as  “very  good” ,  “good” ,  “medium” ,  or  “bad” .  For 
example,  on  the  balance-scale  dataset,  we  say  that  the  formulas  whose  accuracy 
mean  is  above  85%  produce  “very  good”  results;  the  formulas  whose  accuracy 
mean  is  between  80%  and  85%  produce  “good”  results;  the  ones  with  the  mean 
between  75%  and  80%  are  “medium”  and  other  formulas  give  “bad”  results. 
Then,  for  each  formula,  we  construct  a  training  data  set  in  which  an  training 
example  describes  the  characteristics  of  a  dataset  and  also  a  description  in  term 
of  whether  the  formula  produces  “very  good”,  “good”,  “medium”,  or  “bad”  result 
on  this  dataset.  Thus,  to  learn  the  rules  for  each  formula,  we  have  27  training 
examples.  The  characteristics  of  a  data  set  is  described  in  terms  of  number  of 
examples,  number  of  attributes,  number  of  classes  and  the  class  distribution.  A 
sample  of  training  examples  for  learning  the  behavior  rules  of  the  formula  Qjs 
is  shown  in  Table  5. 


4.2  The  Learning  Results 

ELEM2  with  its  default  rule  quality  formula  ( Quo )  is  used  to  learn  the  “beha¬ 
vior”  rules  from  the  training  dataset  constructed  for  each  formula.  Table  6  lists 
some  of  these  behavior  rules  for  each  formula,  where  N  stands  for  the  number 
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Table  5.  Sample  of  training  examples  for  learning  the  behavior  of  a  formula 


Number  of 

Class 

Distribution 

Performance 

Examples 

Attributes 

Classes 

4177 

8 

3 

Even 

Very  Good 

690 

14 

2 

Even 

Medium 

625 

4 

3 

Uneven 

Bad 

683 

9 

2 

Uneven 

Medium 

1728 

6 

4 

Uneven 

Good 

of  examples,  NofA  is  the  number  of  attributes,  NofC  is  the  number  of  classes, 
and  “No.  of  Support  Datasets”  means  the  number  of  the  datasets  that  support 
the  corresponding  rule.  These  rules  summarize  the  predictive  performance  of 
each  formula  in  terms  of  characteristics  of  datasets.  We  further  build  a  set  of 

Table  6.  Formula  Behavior  Rules 


Formula 

Condition 

Decision 

Rule 

Quality 

No.  of  Support 
Datasets 

^  (768<N<1728) 

Very  good 

1.30 

4 

(N<653)and(NofA>10)and(NofC<7) 

Good 

1.36 

5 

Qws 

(625<N<1728)and(NofA>8)and(ClassDistr!=Even) 

Very  good 

1.48 

4 

(N>336)and(NofC>5) 

Good 

1.38 

4 

Qci 

(N>270)and(8<NofA<  15) 

Very  good 

1.66 

5 

(15<NofA<57 

Good 

1.43 

7 

Qls 

(N>2310) 

Very  good 

1.45 

5 

(N<87) 

Bad 

2.41 

2 

Qmd 

(N>768)and(8<NofA<  16) 

Very  good 

2.04 

3 

(351<N<4601)and(NofA>13) 

Good 

1.23 

6 

(N>958)and(NofC<5) 

Very  good 

1.79 

5 

mgjm 

(N<87) 

Bad 

1.51 

2 

ESKSIli! 

(N>101)and(10<NofA<18)and(NofC>2) 

Very  good 

2.04 

3 

mm 

(270<N<690)and(NofA<15) 

Medium 

2.25 

6 

Qis 

(N>150)and(NofC>2)and(ClassDistr=Even) 

Very  good 

4 

(N<101) 

Bad 

mEM 

3 

(N<214)and(NofA>7)and(NofC<6) 

Very  good 

1.80 

3 

(N>768)and(8<NofA<57) 

Medium 

1.70 

6 

Qy2.. 

(N<  178)and(NofA>9) 

Very  good 

1.67 

2 

(N<214)and(4<NofA<9) 

Bad 

1.39 

3 

(345<N<1484)and(NofA<8) 

Very  good 

3 

(4<NofA<6) 

Bad 

3 

(9<NofA<14)and(NofC<2)) 

Very  good 

1.91 

(N>24) 

Bad 

0.98 

20 

formula-selection  rules  by  combining  all  the  “very  good”  rules,  i.e.,  the  rules 
that  predicts  “very  good”  performance  for  each  formula,  and  use  them  to  select 
a  “right”  formula  for  a  (new)  dataset.  For  formula  selection,  we  can  use  the 
ELEM2  classification  procedure  that  takes  formula-selection  rules  to  classify  a 
data  set  into  a  class  of  using  a  particular  formula. 

4.3  ELEM2  with  Multiple  Rule  Quality  Formulas 

With  formula-selection  rules,  ELEM2  has  the  flexibility  of  using  different  for¬ 
mulas  on  different  datasets.  To  see  how  this  strategy  works,  we  conduct  ten-fold 
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evaluation  of  the  “flexible”  ELEM2  on  the  27  datasets  we  used  before.  The  result 
is  shown  in  Figure  2,  in  which  the  average  accuracy  mean  from  the  “flexible” 
ELEM2  (labeled  as  “Combine”  in  the  graph)  is  compared  with  the  ones  from 
using  individual  formulas.  We  also  conduct  paired  t-tests  to  see  how  much  the 


Fig.  2.  Average  of  accuracy  means  of  each  formula  on  the  22  datasets 

flexible  ELEM2  improves  over  the  ELEM2  with  a  single  rule  quality  formula. 
The  p-values  from  the  t-test  are  shown  in  Table  7.  We  can  see  that  “Combine” 
improves  all  the  single  formulas  significantly. 

Table  7.  Significance  levels  of  the  improvement  of  “Combine”  over  individual  formulas 


QC2 

Qws 

Qc  i 

Qmd 

Q  Coleman 

Q<3'2.os+ 

Qis 

ESS! 

KfflH 

p- value 

0.0139 

0.0009 

0.0154 

rotund 

0.0006 

0.0002 

fitiTiIiM 

°.  1 

|  0.0001  1 

|  0.0005 

LXJ 

5  Conclusions 

We  have  described  and  experimented  with  various  statistical  and  empirical  for¬ 
mulas  for  defining  rule  quality  measures.  All  formulas  are  applicable  to  a  rule 
induction  system  for  the  purpose  of  post-pruning  and  classification,  but  their  per¬ 
formance  varies  among  the  datasets.  The  empirical  formulas,  especially  Qws, 
work  very  well  even  if  they  are  not  backed  by  statistical  theories.  Among  stati¬ 
stical  formulas,  Qc2 ,  Qc i)  Qls  and  Qmd  work  the  best  on  the  tested  dataset 
and  are  comparable  with  Qws- 

To  determine  the  regularity  of  the  rule  quality  formula’s  performance  in 
terms  of  dataset  characteristics,  we  used  our  learning  system  to  induce  formula- 
behaviour  rules  from  a  dataset  constructed  from  the  experimental  results  for 
different  formulas.  These  rules  provided  ideas  about  the  situations  in  which  a 
formula  leads  to  very  good,  good,  medium  or  bad  performance.  These  rules  were 
also  combined  and  used  to  automatically  select  a  rule  quality  formula  before 
rule  induction  begins.  Our  experiment  showed  that  this  selection  of  rule  quality 
formula  can  lead  to  significant  improvement  over  the  rule  induction  system  using 
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a  single  rule  quality  formula.  Future  work  includes  testing  our  conclusions  on 
more  datasets  to  obtain  more  reliable  formula-behavior  rules.  With  more  data¬ 
sets  available,  we  will  test  the  formula-selection  rules  on  the  datasets  that  are 
different  from  the  datasets  used  for  generating  the  rules. 
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Abstract  The  emergence  of  the  World  Wide  Web  (Web)  technology  and  the 
advance  of  data  capturing  techniques  have  lead  to  exponential  growth  in 
amounts  of  data  being  stored  in  Web  server  logs.  This  growth  in  turn  has 
motivated  researchers  to  seek  new  techniques  for  the  extraction  of  knowledge 
implicit  or  hidden  in  such  data.  Designing  a  web  site  is  a  complex  problem. 

Web  Server  logs  provide  an  opportunity  to  observe  users  interacting  with  the 
site  and  make  improvements  to  that  site’s  structure  and  presentation.  In  this 
paper,  we  motivate  the  need  for  a  Dynamic  data  mining  approach  for  mining 
user  access  patterns  that  uses  previous  mining  results  during  previous  time 
periods.  We  present  an  efficient  approach  that  uses  latest  results  of  data  mining 
and  new  changes  in  Web  server  logs  to  generate  new  mining  rules.  The 
proposed  approach  is  shown  to  be  effective  for  solving  problems  related  to 
efficiency  of  handling  data  updates  and  accuracy  of  data  mining  results.  The 
proposed  approach  does  not  depend  on  the  technique  used  to  generate  new 
frequent  user  access  patterns  during  the  current  episode  (time  period).  In  our 
analysis,  we  have  used  an  Apriori-Like  algorithm  as  a  local  algorithm  to 
generate  frequent  user  access  patterns.  The  experimental  results  show  that, 
comparing  to  Apriori-like  techniques,  our  dynamic  approach  improves  the 
efficiency  of  the  mining  process. 

Keywords:  Knowledge  Discovery,  Data  Mining,  Web  Mining,  User  Access  Patterns, 
Association  Mining,  Web  Structure. 

I  Introduction 

With  the  growing  popularity  of  the  World-Wide  Web  (Web)  and  the  rapid  progress  of 
the  Web  technology,  hundreds  of  millions  of  transactions  are  processed  every  day 
through  the  Web.  Web  servers  keep  log  entries  (files)  for  all  transactions  that  are 
accessing  their  sites,  and  the  sizes  of  those  log  files  are  increasing  by  tens  of 
megabytes  every  day.  Server  logs  reveal  an  enormous  amount  of  information  about 
users,  server  behavior,  changes  in  sites,  and  potential  benefits  of  new  technical 
developments.  Most  institutions  have  not  been  able  to  perform  an  effective  use  of 
Web  server  log  files  for  enhancing  and  improving  server  performance  and  design 
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improvement.  Mining  information  and  knowledge  from  the  Web  transaction  data  has 
become  a  prominent  and  important  research  and  application  area. 

The  behavior  of  user  access  patterns  can  be  detected  by  using  the  history 
contained  in  Web  server  log  files  [10,11,12,13].  Analyzing  and  capturing  similarities 
in  this  behavior  can  enhance  system  performance  and  identify  user  interests.  Many 
studies  have  been  conducted  to  understand  user  motivation  and  reaction,  analyze 
system  performance,  and  improve  system  design  [5,13].  Applying  data  mining 
techniques  on  Web  logs  discovers  interesting  access  patterns  that  can  be  used  to 
restructure  server  sites  in  an  efficient  way.  Unfortunately  most  of  the  existing  data 
mining  techniques  are  iterative  and  require  many  disk  scans  over  transaction  (log) 
files  [1,3,8]. 

Web  applications  require  up  to  date  mining  of  information  from  data  that  changes 
on  a  regular  basis  [7].  Thousands  of  remote  sites  (URLs)  are  daily  created  and 
removed.  In  such  an  environment,  frequent  or  occasional  updates  may  change  the 
status  of  some  interesting  patterns  discovered  earlier  [13].  Discovering  knowledge  is 
an  expensive  operation  [5,6].  It  requires  extensive  access  of  secondary  storage  that 
can  become  a  bottleneck  for  efficient  processing.  Running  of  data  mining  algorithms 
from  scratch,  each  time  there  is  a  change  in  data,  is  not  an  efficient  strategy.  Updating 
previously  discovered  knowledge  could  solve  many  problems  that  data  mining 
techniques  have  faced  for  years;  that  is,  inability  to  handle  data  updates,  lack  of 
accuracy  of  data  mining  results,  and  poor  performance. 

Association  mining  that  discovers  dependencies  among  values  of  an  attribute  was 
introduced  by  Agrawal  et  al.[l]  and  has  emerged  as  an  important  research  area.  The 
problem  of  association  mining,  also  referred  to  as  the  market  basket  problem,  is 

formally  defined  as  follows.  Let  /  ={i,,i2,...  ,in }  be  a  set  of  items  and 
S  —{s, ,  S2 , . . Sm  }  be  a  set  of  transactions,  where  each  transaction  5,  E  S  is  a  set 
of  items  that  is  si  C  I  .  An  association  rule  denoted  by 
X  =>  Y,  X,Y  c  I,  and  X  n  Y  =  @  ,  describes  the  existence  of  a  relationship  between 
the  two  itemsets  X  and  Y. 

Several  measures  have  been  introduced  to  define  the  strength  of  the  relationship 
between  itemsets  X  and  Y  such  as  support,  confidence,  and  interest.  The  definitions 
of  these  measures,  from  a  probabilistic  model  are  given  below. 

I.  Support  (X  =>  Y)  =  P(X,¥)  >  or  the  percentage  of  transactions  in  the  database  that 

contain  both  X  and  Y. 

II.  Confidence  (X  =>  F)  =  P(X  ,Y ) /  P(X ) ,  or  the  percentage  of  transactions 

containing  Y  in  those  transactions  containing  X. 

III.  Interest X  =*  Y  )=  P(X,Y)/P(X  )P(Y )  represents  a  test  of  statistical 
independence. 

Agrawal  et  al  [2],  introduced  the  problem  of  mining  sequential  patterns  over  such 
databases.  Two  algorithms,  AprioriSome  and  Apriori-Like  [2],  have  been  presented  to 


132  A.  Hafez 


solve  this  problem,  and  their  performances  have  been  evaluated  using  synthetic  data. 
The  two  algorithms  have  comparable  performances.  AprioriSome  has  performed 
better  when  the  minimum  number  of  users  required  to  deem  a  sequential  pattern  to  be 
interesting  is  low. 

In  this  paper,  we  propose  an  approach  that  dynamically  updates  knowledge 
obtained  from  the  data  mining  process  during  previous  time  periods.  Transactions 
over  a  long  duration  are  divided  into  a  set  of  consecutive  episodes.  We  propose  a 
modified  structure  for  keeping  updated  log  transactions.  The  proposed  structure 
facilities  the  use  of  different  association  mining  techniques.  Our  approach  discovers 
current  frequent  user  access  patterns  by  using  updates  that  have  occurred  during  the 
current  time  period  along  with  the  frequent  user  access  patterns  that  have  been 
discovered  in  the  previous  time  period. 

In  section  2,  we  give  the  formal  definition  of  the  problem  of  discovering  frequent 
user  access  patterns.  The  proposed  structure  of  Web  transaction  log  and  the  dynamic 
approach  are  described  in  section  3.  Our  experimental  results  are  presented  in  section 
4.  The  experimental  results  are  discussed  and  the  paper  is  concluded  in  section  5. 

2  Problem  Definition 

In  the  original  Web  log  file  OF,  each  request  received  by  the  Web  server  creates  a 
Web  log  entry  e  that  contains  three  components:  User(e)  denotes  the  user-id  of  that 
user  who  originated  the  request,  Time(e)  is  the  time-stamp  of  that  request,  and  url(e) 
is  the  set  of  requested  URLs  [2,4,10].  Examples  2.1  and  2.2  demonstrate,  for  a  given 
Web  server,  the  original  log  file  OF  and  the  current  log  file  CF. 

Example  2.1  The  original  Web  log  file  OF _ 


e  Userje)  Time(e)  url(e) 


1 

i 

4 

ta,b,c,d) 

2 

2 

6 

la,  cl 

3 

1 

8 

la,b,d) 

4 

2 

10 

fa.cjl 

5 

3 

14 

(cl 

6 

2 

16 

Ml 

7 

3 

18 

Ml 

8 

4 

20 

la) 

Example  2.2  The  current  Web  log  file  CF 


e 

User(e) 

Time(e) 

url(e) 

e 

User(  e) 

Time(e) 

url(  e) 

i 

i 

4 

{a,b,c,d} 

9 

3 

24 

lc,df 

2 

2 

6 

fa,c) 

10 

5 

24 

1 a,dl 

3 

1 

8 

{a,b,d} 

11 

i 

26 

fa,  c) 

4 

2 

10 

la.cjl 

12 

2 

30 

MJ1 

5 

3 

14 

{cl 

13 

5 

32 

Ml 

6 

2 

16 

MI 

14 

3 

36 

{ d,b,c } 

7 

3 

18 

Ml 

15 

6 

36 

Ml 

8 

4 

20 

la} 

16 

5 

40 

Ml 

We  adopt  the  same  definitions  used  in  [2]  to  define  the  terms  sequential  pattern, 


support,  confidence,  and  frequent  k- sequence. 


sequential  pattern 
support(X) 


is  defined  as  a  set  of  one  or  more  URLs  that  are  accessed  sequentially, 
is  defined  as  the  ratio  of  users  who  have  requested  sequential  pattern  X. 
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confidence(X=}Y )  is  defined  as  the  ratio  of  users  who  have  requested  sequential  pattern  X  and  Y 

among  users  who  have  requested  sequential  pattern  X. 
frequent  k-sequence  is  defined  as  a  set  of  k  urls  that  are  accessed  sequentially,  and  has  support 
greater  than  or  equal  a  support  threshold  minsup 


Example  2.3  Let  minsup  =0.5.  The  frequent  k-sequences  derived  from  Web  log  file 
OF,  defined  in  Example  2.1,  are _ 

Frequent  1  -sequence  a  {(1 ,4),(2,6),(1 ,8), (2, 10),  (2, 16), (4,20)}  support(a)=0.75 

c  {(1,4),(2,6),(2,]0),  (3, 14), (3, 18)}  support(b)=0.75 

f  {(2,10), (2,16), (3, 18)}  support(f)—0.5 

Frequent  2-seauence  ac  {(1,4), (2, 6), (2, 10)}  support(ac)=0.5 

_ cf  {(2,10),(3,18)} _ support(cf)=0.5 

Example  2.4  Let  minsup  =0.5.  The  frequent  k-sequences  derived  from  Web  log  file 


CF,  defined  in  Example  2.2,  are 


Freauent  1-seouence 

A 

((1,4),(2,6),(1,8),(2,10),(2,16),(4,20),(5,24), 

(1,26),(2,30),(5,32),(3,36» 

support(a)=0.833 

B 

((1,4),  (1,8),  (3, 36),  (6, 36),  (5, 40)] 

support  (b)=0.75 

c 

((1,4), (2,6), (2, 10X3, 14), (3,18), (3, 24), (1,26), 
(2,30),(5,32),(3,36),(6,36),(5,40)] 

support  (c)=0.833 

D 

(0,4), (1,8), (3, 24), (5, 24)] 

support  (d)=0.5 

Frequent  2-sequence 

Ac 

(0,4), (2, 6), (2, 10), (1,26), (2, 30), <5, 32), (3, 36)) 

support  ( ac)=0.75 

Be 

1(1, 4), (3, 36), (6, 36), (5, 40)1 

support  (bc)=0.75 

3  The  Dynamic  Approach 

Knowledge  discovery  of  patterns  is  defined  as  locating  those  patterns  in  which 
accesses  to  different  resources  consistently  occurring  together,  or  accesses  from  a 
particular  place  occurring  at  regular  times  [4,11,12].  In  our  approach,  we  define  a 
structure  for  keeping  log  transactions.  Rather  than  describing  log  entries  with  respect 
to  their  entry  order,  we  map  the  original  structure  of  Web  log  files  into  an  equivalent 
structure  where,  for  each  URL,  there  exists  a  set  ID(URL)  such  that  each  element  in 
ID(URL)  is  a  pair  <user-id,  time-stamp>.  Formally  speaking,  for  a  given  log  file  F 
and  a  Web  page  URL,  ID(URL)=  {( User(e),Time(e))\VeE  F,  URL=Url(e)}.  In 
examples  3.1  and  3.2,  we  demonstrate,  for  a  given  Web  server,  the  proposed 
mappings  of  original  log  file  OF  and  the  current  log  file  CF,  respectively. 

Example  3.1  The  mapping  of  the  original  Web  log  file  OF  defined  in  example  2.1  is 

URL  ID(URL) _ 

A  ((1,4), (2, 6), (1,8), (2,W).(2, 16).(4,20)) 

B  ((1,4), (1,8)) 

C  ((1,4), (2, 6), (2, 10), (3, 14), (3,18)] 

D  (0,4), (1,8)) 

F  ((2,10),(2,16),(3,I8» _ 


Example  3.2  The  mapping  of  the  current  Web  log  file  CF  defined  in  example  2.2  is 

URL  ID(URL) _ 

a  ((1,4), (2, 6), (1,8), (2, 10), (2,16), (4, 20), (5, 24), (1,26), (2, 30), (5, 32), (3, 36)} 

b  ((1,4),  (1,8),  (3, 36),  (6, 36),  (5, 40)} 

c  ((1,4),(2,6),(2,10)(3,14),(3,18),(3,24),(1,26),(2,30),(5,32),(3,36),(6,36),(5,40)} 

d  [(1,4), (1,8), (3, 24), (5, 24)} 

f  ((ZIP), (2, 16), (3,18), (2, 30)) _ 


Web  log  files  keep  information  of  all  accesses  including  those  accesses  to  those 
canceled  Web  pages,  that  could  be  canceled  along  time  ago.  Mining  algorithms 
should  keep  a  list  of  those  canceled  pages  in  order  to  not  counting  those  deleted  Web 
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pages,  but  still  scanning  the  whole  log  file  is  considered.  In  the  context  of  Web 
mining  [9,10,11,12],  It  is  a  better  strategy  to  use  those  mining  results  collected  in  the 
last  mining  session  and  only  apply  the  mining  procedure  only  on  those  transactions 
added  to  the  Web  log.  In  this  paper,  we  propose  a  dynamic  algorithm  for  mining  user 
access  patterns,  that  treats  Web  log  transactions  as  sequences  over  periods  of  time  and 
uses  the  latest  discovered  (in  a  previous  session)  association  rules  to  improve  the 
efficiency  of  the  mining  process. 

The  storage  requirement  for  keeping  the  structure  of  transaction  updates  is  a  N 

where  N  is  the  number  of  disk  blocks  needed  to  store  the  transaction  updates,  and 
(X  is  the  reduction  factor  caused  by  grouping  user  ID’s  of  the  same  Web  page. 

In  this  section,  we  introduce  the  notions  of  continuous  pattern,  non-continuous 
pattern,  uprising  pattern,  and  non-uprising  pattern. 

Definition  3.1  A  sequence  X  is  a  continuous  pattern  through  two  time  periods 

T,  and  T2  if 

support  r  (X)  >  minsup  and  support  r  (X)  >  minsup 

Definition  3.2  A  sequence  X  is  a  non-continuous  pattern  through  two  time 

periods  7)  and  T2  if 

support  r  (X)  >  minsup  and  support  (X)  <  minsup 

Definition  3.3  A  sequence  X  is  an  uprising  pattern  through  two  time  periods 

T i  and  T2  if 

support  r  (X)  <  minsup  and  support  r  (X)  >  minsup 
Definition  3.4  A  sequence  X  is  a  non-uprising  pattern  through  two  time 

periods  7)  and  7)  if 

support  T/  (X)  <  minsup  and  support  Ti  (X)  <  minsup 

In  order  to  minimize  the  number  of  disk  scans  and  keep  only  the  necessary 
information,  we  only  consider  Web  log  changes  in  time  period  7)  along  with  the 
results  obtained  during  time  period  7)/,  for  i= 2,3,....  Time  values  and  repeated  user¬ 
id's  are  omitted  from  k-sequences.  A  new  parameter  disp,  (  X  )  is  defined  to  reflect 
the  displacement  of  the  k-sequence  X  in  the  time  period  7). 

Definition  3.5  Let  X  be  a  k-sequence  of  URLs  and  ID(X)  be  a  set  of  pairs  (uj,  tj), 

j=l,2,...,  D,  and  Ui  <lj  ~Ti\  r;  =rt -i  +Ti,  where  uj  and  tj  are  a  user-id  and  its  time- 
stamp,  respectively.  Displacement  of  X  in  time  period  7)  is  defined  as 

±h<*) 

j  =  i 

j-  /  v  \  Ti-l  <li<  x  )STi 

disp  T(X  )  =  - - 

Example  3.3  Let  minsup  =0.5.  The  frequent  k-sequences  derived  from  transaction  file 
OF  during  period  7)  and  those  updated  transactions  in  transaction  file  CF  during 
period  T2,  defined  in  Examples  3.1  and  3.2,  respectively,  are 
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T, 

disp  T 

t2 

dispT 

a 

{ 1,2,4 1 

supportf a  )-0.75 

10.66 

a 

(1, 2.3.5) 

Support(a)=0.75 

2  9.6 

c 

H.2,3) 

support ( b)-0.75 

10.4 

b 

(3,5,6) 

Support(b)-0.5 

37.33 

f 

12.3] 

support(f)-0.5 

14.66 

c 

(1,2,3,5,6) 

Support  (c)=0.833 

32 

ac 

(1.2} 

support(ac)-0.5 

6.66 

ac 

(1,2,3.51 

Support  (ac)=0.75 

31 

(2.3) 

support(cf)=0.5 

14 

be 

(3,5,6) 

Support  (bc)=0.5 

37.33 

We  consider  only  those  patterns  defined  in  definitions  3.1  to  3.3,  which  are  divided 
into  two  categories, 

Category  1  continuous  patterns  and  uprising  patterns. 

Category  2  non-continuous  patterns. 

patterns  in  category  1  are  automatically  included  as  frequent  k-sequences.  non- 
continuous  patterns  (in  category  2)  are  considered  frequent  k-sequences  if 

STi(X)  supporf  (X)+(l-ST(X))supporf  (X)>minsup,  where  0  <  <5 r  (X)  <1  (1) 

In  our  experimental  work,  we  choose  the  value  of  ST  (  X  j  to  be  dependent  on 

the  behavior  of  k-sequence  X  through  time  periods  Tj.)  and  Th  where 

ST(X)  =  <’‘-#VtJX» 

'  (  dispT  (  X  )-dispT  (  X  )) 

Algorithm  DynamicApriori 

f'(T2  )  =  {  frequent  l  -  sequences};^con^moas  frequent  1-sequences  and  uprising  frequent  1-sequences, 
fi(T; )  - 1  non  -  continues  1  -  sequences)  •' 

fi(T*  >  =  fi*(Ti  f,*(T,  )nff(T, )  and  ST>(x )  support  Ti(x)  +  (I  -  5  h(x»  support T,(x)>  minsupl 

for  (k=2;ft.,(T2)^;k++)  do 
begin 

Ck=AprioriGen(ft.,(T2)); 

forall  transactions  (g  p  do 
'2 

forall  candidates  ceCk  do 

if  cQt  then  c.count++; 

ft*(T  2  )  -  I  c  s  C  k  |  c.count  >  minsupl 

fk(T2  )  =  {c  6  Ck  1 3  ce  fk_,(TI )  and  c.count  <  minsup] : 

f  k(  T  2  )=  ft*  (T  2  lo  (  x  I  *€  ft*(T,  )n  f  k~  (  T  2  )  and 

STi(x)  support  T,(x)  +  (1  -  8  r,  M)  support  T,(x)  -  minsupl 

end; 

return  uftfTi); 

function  AprioriGen(ft.i) 
insert  into  C* 

select  h.h - ,lt-i,Ct.i 

from fk.:  l,fr,c 

where  h-cjA  12=c>a  . . .  a  lk-2=Ct-2A  lk  i«'ci; 
delete  all  items  ceCt  such  that  (k-l)-subset  of  c  is  not  in  Fk.i; 
return  Ck; 

Figure  3.1  The  DynamicApriori  Algorithm 

As  we  mentioned  before,  the  Dynamic  approach  can  use  any  data  mining 
technique,  as  a  local  technique,  to  generate  frequent  user  access  patterns.  In  this 
paper,  we  demonstrate  our  dynamic  approach  using  the  Apriori-Like  algorithm.  The 
Apriori-Like  algorithm  is  slightly  modified  to  reflect  those  new  factors  needed  to 


136  A.  Hafez 


perform  the  dynamic  mining.  As  it  is  shown  in  Figure  3.1,  The  DynamicApriori 
algorithm  mainly  follows  the  main  outlines  of  the  Apriori-Like  algorithm.  The 
DynamicApriori  algorithm  is  decomposed  into  two  modules, 

•  Using  previous  frequent  sequences,  all  sequences  that  satisfy  inequality  (1)  are  generated. 

•  From  those  itemsets  generated  in  the  first  module,  generate  all  association  rules  that 
satisfy  certain  minconf  value. 

In  the  DynamicApriori  algorithm,  the  number  of  disk  accesses  required  is 

m  a  N 

where  N  is  the  number  of  disk  blocks  needed  to  store  the  transaction  updates,  O.  is 
the  reduction  factor  caused  by  grouping  user  ID’s  of  the  same  Web  page,  and  m  is  the 
size  of  maximal  frequent  user  pattern. 

4  Performance  Results 

The  DynamicApriori  algorithm  has  been  tested  and  compared  to  the  performance 
of  the,  Apriori-Like  algorithm,  using  the  following  assumptions: 

•  Total  Time  length  is  one  year 

•  minsup  values  are  uniformly  distributed  over  the  range  0.05  and  0.2 

•  Users  inter-arrival  time  is  exponentially  distributed  with  means  1  minute  and  5  minutes. 

•  Users  and  URLs  (Web  pages)  are  normally  distributed  (generated  from  uniform  distributions 
with  means  10000  users  and  100  URLs,  and  50000  users  and  250  URLs). 

•  Number  of  URLs  per  user  is  uniformly  distributed  with  mean  20. 

In  the  DynamicApriori  algorithm,  the  one  year  time  interval  is  equally  divided 
into  equal  time  periods.  Five  different  period  sizes;  1,2, 3, 4  and  6  months,  have  been 
considered. 

In  our  experimental  results,  we  compare  the  number  of  disk  accesses  of  the 
DynamicApriori  algorithm  and  the  Apriori-Like  algorithm.  Our  experiments  use  the 
same  time  interval  in  both  algorithms.  As  an  example,  for  1  month  time  period,  both 
algorithms  have  been  executed  12  times,  and  the  accumulated  results  are  compared. 
The  frequent  sequences  generated  by  the  two  algorithms  are  compared,  and  the  ratios 
between  the  number  of  same  frequent  sequences  generated  by  both  algorithms  and  the 
number  of  frequent  sequences  generated  by  each  of  the  two  algorithms  are  calculated. 
In  figures  4.1  and  4.2,  we  give  the  results  of  our  experimental  results.  In  figure  4.1, 
the  number  of  disk  accesses  needed  for  the  Apriori-Like  algorithm  are  compared  to 
those  of  the  DynamicApriori  algorithm.  We  have  found  that,  for  small  time  periods, 
the  difference  between  the  two  algorithms  is  large  (for  1  month  period,  almost  120 
times),  which  is  acceptable  due  to  the  following  two  factors: 

•The  dynamic  approach  uses  only  those  transaction  updates,  not  the  whole  transaction  file. 
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•The  size  of  transaction  (mapped)  file  is  reduced  after  eliminating  time  values  and 
duplicate  user-ids. 


Time  Periods 


Figure  4.2  M  atched  Frequent  Sequences 

In  figure  4.2,  we  compare  the  ratio  of  matched  frequent  sequences,  i.e.,  those  found 
by  both  algorithms.  For  time  periods  greater  than  or  equal  1  month,  the  results  have 
shown  that  the  DynamicApriori  algorithm  has  generated  most  of  the  frequent 


6  month  4  month  3  month  2  month  1  month 
P  e  rlod  S  ize 

Figure  4.1  Performance  Evaluation  of  PeriodicA  priori 
A  lg  o  r  it b  m 

sequences  that  have  been  generated  by  the  Apriori-Like  algorithm.  For  those  frequent 
sequences  that  have  been  generated  by  the  DynamicApriori  algorithm  and  not  by  the 
Apriori-Like  algorithm,  and  vice  versa,  we  have  carefully  studied  their  behavior,  and 
found  out  that  our  approach  missed  only  those  frequent  sequences  with  low  time  span. 

5  Discussion  and  Conclusions 

In  this  paper,  we  have  introduced  a  dynamic  approach  for  Knowledge  Discovery 
of  Web  Access  Patterns.  In  this  approach,  the  time  space  is  divided  into  time  periods, 
and  the  association  mining  procedure  is  applied  only  on  one  time  interval  and  uses 
those  association  rules  discovered  in  the  previous  time  interval.  To  demonstrate  our 
dynamic  approach,  we  have  used  the  Apriori-Like  algorithm  as  a  local  algorithm  to 
generate  frequent  user  patterns  during  time  periods.  A  set  of  experiments  has  been 
performed,  and  the  results  of  the  DynamicApriori  algorithm  and  the  Apriori-Like 
algorithm  are  compared.  Although  we  have  used  synthetic  data  to  run  our 
experiments,  we  have  carefully  chosen  the  distribution  functions  that  reflect  the 
behavior  of  users  and  Web  pages. 
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In  our  experimental  work,  we  choose  the  value  of  S  (X)  to  be  dependent  on  the 
behavior  of  k-sequence  X  through  the  different  time  periods.  We  believe  that  by 
applying  different  techniques  to  choose  the  values  of  S  (X),  we  may  much  further 
improve  the  performance  of  our  approach. 

The  experimental  results  have  shown  that  the  DynamicApriori  algorithm  has 
efficiently  generated  frequent  sequences,  which  are  used  to  generate  association  rules. 
Depending  on  the  time  interval  length,  the  ratio  between  the  number  of  disk  blocks 
accessed  by  the  two  algorithms  ranged  between  8.0357  and  112.069,  in  favor  of  the 
DynamicApriori  algorithm.  For  a  reasonable  time  interval  (greater  than  or  equal  1 
month),  DynamicApriori  algorithm  has  generated  most  of  the  frequent  sequences  that 
have  been  generated  by  the  Apriori-Like  algorithm.  For  those  frequent  sequences  that 
have  been  generated  by  the  DynamicApriori  algorithm  and  not  by  the  Apriori-Like 
algorithm,  and  vice  versa,  we  have  carefully  studied  their  behavior,  and  found  out  that 
our  approach  missed  only  those  frequent  sequences  with  low  time  span.  These  results 
favor  our  approach  and  prove  that  our  algorithm  not  only  produces  frequent 
sequences  but  also  implicitly  performs  time  analysis  on  the  discovered  frequent 
sequences. 
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Abstract.  This  paper  introduces  a  new  method  for  instances  selection. 
The  conceptual  framework  and  the  basic  notions  used  by  this  method  axe 
those  of  an  extended  rough  set  theory,  called  a-rough  set  theory.  In  this 
context  we  formalize  a  notion  of  conflicting  data,  which  is  at  the  basis  of 
a  conflict  normalization  method  used  for  instances  selection.  Extensive 
experiments  are  performed  to  show  the  efficiency  and  the  accuracy  of 
models  built  from  the  reduced  datasets.  The  selection  methodology  and 
its  results  are  discussed. 


1  Introduction 

One  way  to  achieve  an  efficient  processing  of  very  large  data  is  to  reduce  the 
number  of  input  data  without  losing  the  main  information  and  without  decrea¬ 
sing  the  quality  of  the  extracted  knowledge.  To  deal  with  the  problem  of  data 
reduction,  many  methods  have  been  proposed.  Generally,  two  main  approaches 
are  distinguished:  statistical  sampling  techniques  and  clustering  or  prototyping 
approaches.  For  instance,  Quinlan  [10]  used  windowing  approach  in  ID 3  to  learn 
on  subsets  of  tuples,  Catlett  [3]  considered  windowing  in  C4.5,  who  uses  strati¬ 
fication  according  to  the  decision  attribute,  John  and  Langley  [7]  discuss  static 
versus  dynamic  sampling,  Toivenen  [12]  and  Zaki  et  al.  [14]  examine  applications 
of  random  sampling  for  finding  association  rules,  Reinartz  [11]  reuses  a  variant 
of  leader  clustering  algorithm  of  Hatingan[6].  He  proposed  a  similarity-driven 
sampling  approach,  which  is  based  on  two  steps:  sorting  and  stratification.  Du¬ 
bes  et  al.  [4]  discuss  clustering  methodologies.  Whereas,  Zhang[15]  proposes  a 
data  summarization  algorithm  using  a  single  scan  incremental  process  to  create 
a  hierarchical  tree  of  sub-clusters  summarizing  the  original  dataset. 


2  Data  Analysis  and  Rough  Sets 

2.1  Rough  Sets  Overview 

Rough  Sets  Theory  (RST)  is  an  extension  of  set  theory.  It  was  introduced  by 
Z.  Pawlak  [9]  in  1982  to  offer  a  framework  for  handling  imperfect  data.  It  is 
a  mathematical  tool,  which  deals  with  vagueness  and  uncertainty.  There  has 
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been  a  fast-growing  interest  in  the  rough  set  theory,  which  has  proved  to  be 
very  useful  in  practice.  Successful  applications  have  been  developed  in  medicine, 
decision  analysis,  banking,  market  research,  knowledge  discovery,  and  so  on. 
Before  presenting  our  investigations,  we  will  first  review  the  main  concepts  of 
rough  set  theory. 

Information  system:  In  rough  sets  theory,  an  information  system  has  a  data 
table  form.  Formally,  an  information  system  S  is  a  4-tuple  S  =  (U,Q,V,g), 
where:  U  is  a  finite  set  of  objects;  Q  is  a  finite  set  of  attributes;  V  =  UVq,  where 
Vq  is  a  domain  of  attribute  q\  g  is  an  information  function  assigning  a  value  of 
attribute  for  every  object  and  every  attribute,  i.e.,  g  :  U  xQ  t->-  V,  such  that  for 
every  x  £  U  and  for  every  q  £  Q,  g{x,q )  £  Vq. 

Indiscernibility  relation:  Let  K  be  a  subset  of  attributes,  the  indiscernibility 
relation,  denoted  Ik  (C  U  X  U),  is  assumed  to  be  an  equivalence  relation,  which 
is  defined  as  follows: 

x  Ik  y  <*=>  V  p  £  K  xp  =  yp 

Consequently,  x  is  related  to  y  if,  and  only  if,  they  have  the  same  value  for  all 
attributes  in  K.  The  pair  (U,  Ik)  is  called  a  Pawlak  approximation  space.  The 
relation  Ik  is  an  equivalence  relation,  which  partitions  the  space  U  into  disjoint 
subsets.  The  quotient  set  U /Ik  consists  of  equivalence  classes  of  Ik,  also  called 
elementary  sets. 

Approximations  of  sets:  A  key  idea  in  rough  set  theory  is  the  approximation 
of  concepts  using  two  operators,  which  assign  to  any  subset  of  the  universe, 
X  C  U,  two  approximations  called  lower  and  upper  approximations  denoted 
respectively  Iiower  and  I  upper  • 

Ilower(X)  =  {x  £  U  |  Ik(x)  C  X}  ,  Iupper{X)  —  {x  £  U  \  I k(x)  ft  X  ^  (j)} 

The  Iiower  approximation  of  X  is  the  set  of  elements,  which  certainly  belong  to 
X,  whereas  the  Iupper  approximation  of  X  is  the  set  of  elements,  which  possibly 
belong  to  X.  Elements  which  are  probably  in  X  but  do  not  certainly  belong 
to  X  define  a  doubtful  region  called  the  boundary  region,  i.e.,  Bound(X)  = 
lupper(X)  — Iiower (X).  We  say  that  a  set  X  is  rough  (inexact)  when  its  boundary 
is  a  non  empty  set.  In  this  paper,  we  introduce  a  new  method  for  data  reduction 
based  on  conflicting  data  analysis.  The  notion  of  conflicting  data  is  based  on  the 
relationship  between  boundaries  of  concepts. 

2.2  Conflicting  Data  Analysis 

The  notion  of  conflict  plays  an  important  role  in  different  domains  like  business 
and  military  operations.  Different  formal  models  of  conflict  have  been  proposed 
[5]  [8].  We  use  the  notion  of  boundary  to  express  conflictual  relations  between 
concepts.  The  normalization  of  this  conflict  leads  to  the  selection  of  a  subset 
of  instances.  Let  us  first  introduce  a  binary  relation  between  two  instances. 
According  to  a  subset  of  attributes  K,  two  instances  x  and  y  are  said  to  be 
allied,  denoted  Xp{: r,  y),  if  they  have  the  same  value  of  all  attributes:  Xp(x,  y)  = 
1  if  xp  =  yp  V  p  £  P  and  0  otherwise. 
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Definition  1  (Conflicting  instances)  Let  D  be  the  set  of  condition  attribu¬ 
tes ,  C  the  decision  attribute,  Q  —  D  U  C.  We  say  that  x,y  G  U  are  conflicting 
if,  and  only  if,  they  are  redundant  or  inconsistent: 

—  Redundant  instances  have  the  same  value  for  both  condition  and  decision 
attributes,  i.e.  Xq(x,y )  =  1. 

—  Inconsistent  instances  have  the  same  value  for  condition  attributes,  but 
different  values  for  decision  attribute,  i.e.,  Xr>(x,y)  =  1  A  Xc(x,y)  =  0. 


2.3  Extending  Rough  Set  Theory 

Rough  set  theory  is  formulated  using  a  basic  notion  of  indiscernibility  between 
objects  of  the  universe,  which  is  based  on  binary  relations.  Many  different  stu¬ 
dies  have  been  developed  to  extend  rough  set  theory  by  replacing  the  classical 
equivalence  relation  by  different  kind  of  binary  relations.  The  choice  of  a  given 
indiscernibility  relation  directly  alters  the  interpretation  of  rough  sets. 

Definition  2  (  a— Indiscernibility)  Let  K  be  subset  of  attributes,  a  G  [0, 1], 
and  Ik  the  Pawlak  indiscernibility  relation.  Two  instances  x,  y  of  the  universe 
U  are  said  to  be  a-indiscernible,  denoted  x  Iff  y,  if,  and  only  if: 

3  K'  C  K  |  xIK.  y  and  f(K,K')  >  a 

Consequently,  the  semantic  of  the  indiscernibility  relation  is  rich  as  the  fun¬ 
ction  /  can  be  defined  according  to  the  prior  domain  knowledge.  In  what  fol¬ 
lows  we  consider  the  following  domain-independent  function  /:  f(K ,  K')  =  , 

where  \K\  denotes  the  cardinality  of  K. 

The  extension  of  the  well  known  definition  of  indiscernibility  Ik  is  impor¬ 
tant,  especially  for  high  dimensional  spaces.  In  fact,  the  relation  Ik  tends  to 
break  down  in  high  dimensional  spaces.  The  main  reason  is  that  the  resulted 
partitioning  of  the  universe  is  probably  very  fine  when  the  cardinality  of  the  set 
of  attributes  K  is  very  high:  for  any  pair  of  objects  of  the  universe,  it  likely  exists 
few  dimensions  for  which  this  objects  are  indiscernible.  Different  algorithms  have 
proposed  to  deal  with  this  problem  especially  in  the  context  of  clustering  of  high 
dimensional  spaces.  Our  formalization  is  a  new  way  to  consider  this  problem 
in  the  context  of  rough  set  theory  and  met  the  approach  developed  for  fast 
algorithms  for  projected  clustering  developed  by  Aggarwal  et  al.  [1] . 

The  use  of  this  parameterized  indiscernibility  relation  leads  to  a  weak  defini¬ 
tion  of  conflict.  In  fact,  we  can  easily  express  the  two  main  notions  of  conflicting 
data,  introduced  before,  i.e.,  redundancy  and  inconsistency,  using  Iff.  Two  in¬ 
stances  x,  y  are  said  to  be: 

—  Redundant  iff  x  Tq  Vi  (Le->  Xq(x,v)  =  !)• 

—  Inconsistent  iff  x  Ip  y  A  -i(  x  Iq  y)  (i.e.,  Xd(x ,  y)  =  1  and  Xc(x,y)  —  0 

Consequently,  the  binary  relation  Iff  allows  us  to  express  a  weak  notion  of 
conflict. 
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Definition  3  (o— conflicting  instances)  Two  instances  x,y  of  the  universe  U 
are  said  to  be  conflicting  at  the  level  a  if,  and  only  if: 

3  K  C  D  such  that  Xj<(x,y)  =  1  and  f(D,K)  >  a. 

Thus,  two  instances  x,  y  weakly  conflicting  are  said  to  be  weak  redundant, 
respectively  weak  inconsistent,  if  they  are  weakly  conflict  and  Xc(x,y)  =  1, 
respectively  Xc(x,y)  —  0.  Data  reduction  can  be  viewed  as  normalizing  con¬ 
flicts.  We  resolve  the  conflict  between  instances  by  selecting  only  one  instance  of 
dominant  concept  in  each  conflicting  group.  The  strongly  conflicting  instances 
(corresponding  to  classical  framework  of  rough  sets)  are  obtained  when  a= 1. 
The  parameter  a  influences  and  controls  the  number  of  conflicting  instances, 
i.e. ,  when  a  decreases,  the  size  of  (weakly)  conflicting  group  increases.  Conse¬ 
quently,  the  number  of  selected  instances  decreases.  The  process  of  selection  will 
be  detailed  in  the  next  section. 

3  Conflicting  Data  Normalization 

3.1  Foundations 

The  goal  of  this  section  is  the  introduction  of  the  concept  of  conflicting  data 
normalization  and  the  description  of  a  method  supporting  the  normalization 
process.  We  have  underlined  two  types  of  conflict,  i.e.,  redundancy,  inconsistency 
and  weak  conflict.  For  this  reason  the  normalization  process  is  divided  into  two 
steps:  (1)  redundancy  reduction  and  (2)  inconsistency  normalization.  All  incon¬ 
sistent  instances  belong  to  a  set,  which  is  equal  to  the  union  of  boundaries  of  all 
concepts  Cf  defined  by  the  following  constraint  "C  =  C"  where  C  is  the  decision 
attribute.  The  set  of  all  inconsistent  instances  is  called  Global  Boundary  and 
denoted  GB  =  (J  Bound(Ci).  Whereas,  the  redundant  instances  belong  to  the 
complement  of  the  set  GB,  i.e.,  U  —  GB,  which  is  equal  to  the  union  of  the 
lower  approximations  of  all  concepts  C).  The  normalization  process  reduces  the 
redundancy  and  normalizes  the  inconsistency  as  follows: 

—  Redundancy:  let’s  consider  that  a  set  of  instances  T  =  {aq,  xi, ...,  xn}  are 
redundant,  which  means  that  Xo{Xi,Xj)  =  1  and  Xc{xi,Xj)  =  1  for  all  i,  j 
in  {1,  2, ...,  n}.  These  instances  are  identical,  we  keep  only  one  among  them 
all  the  other  instances  are  deleted. 

—  Inconsistency:  as  we  have  seen  before  the  GB  contains  all  inconsistent 
instances.  Let  T  £  GB/Id  and  0  be  an  operator  such  that: 

0{T)  =  {Ci  £  GB/Id  |  THCi^cf} 

The  result  of  the  operator  6  is  the  set  of  conflicting  concepts  given  the  set 
T  of  conflicting  instances.  Only  dominant  concepts  are  kept.  We  define  the 
operator  ip  as  follows: 


iP{T)  =  {Ck  :  \CkDT\  =  Max{\CinT\  :  Ci  £  0(T)}} 


Data  Reduction  via  Conflicting  Data  Analysis  143 


The  cardinality  of  a  set  X  is  denoted  |X|.  The  result  of  the  operator  ip  is 
the  set  of  dominant  concepts  given  a  set  of  concepts  9(T).  The  operator 
ip  carries  out  a  voting  operation  between  conflicting  concepts.  The  inconsi¬ 
stency  normalization  means  the  replacement  of  each  set  T  of  GB /Id  by  only 
one  instance  representing  the  dominant  concepts.  If  there  are  m  dominant 
concepts,  m  instances  from  T  are  randomly  selected,  each  one  represents  a 
dominant  concept.  Thus,  instances  of  T  that  belong  to  non  dominant  con¬ 
cepts  are  deleted  and  a  pruning  operation  is  realized  on  dominant  concepts. 

The  previous  normalization  depends  on  the  parameter  a.  Considering  only  strict 
conflicting  data,  i.e. ,  a  =  1,  which  means  that  all  attributes  are  used  to  distin¬ 
guish  instances.  However,  taking  into  account  all  attributes  in  the  situation  where 
we  consider  high  dimensional  spaces  is  a  real  obstacle  for  conflict  analysis.  In 
fact,  the  more  the  cardinality  of  GB  is  low  the  more  the  conflict  is  reduced.  For 
this  reason,  we  can  vary  the  value  of  a  from  1  to  0  to  evaluate  the  weakness  of 
conflict  between  concepts. 

In  order  to  evaluate  the  cost  of  this  approach,  let  us  assume  that  there  are 
N  instances.  In  the  worst  case,  the  number  of  comparisons  to  compute  the 
approximations,  i.e.,  to  find  all  conflicts,  is  equal  to  and  the  conflict 

normalization  needs  N  comparisons.  Consequently,  the  total  comparisons  needed 
to  select  instances  is  +  TV.  Sc>,  the  complexity  is  0(N2). 

3.2  Conflict  Normalization  by  Hand 


This  simple  example  considered 
here  shows  how  the  normalization  me¬ 
thod  works  step  by  step.  Let  us  con¬ 
sider  the  information  system  drawn  in 
Table  1.  The  universe  U  contains  16  in¬ 
stances  and  equals  to  {1,2, ...,  16}.  The 
set  of  attributes  Q  =  {qq,  ,  <2.3 ,  <74 } 
is  divided  into  condition  attributes 
D  =  {<?i, <?2, 9.3}  and  decision  attribute 
C  =  {<74}.  The  partitioning  produced 
when  we  consider  condition  attributes 

and  decision  attribute  are  respectively  U/Id  =  {{1,  2,  3, 4}, {5}, {6},  {7},  {8,9, 
15,  16},  {10},  {11},  {12, 13},  {14}}  and  U/Ic  =  {{1, 3, 7, 11},  {2, 5, 9, 12,  13, 14, 
16},  {4, 6, 8, 10, 15}}.  According  to  the  latter  partitioning,  we  obtain  three  con¬ 
cepts  Ci  =  {1,3, 7, 11},  C2  =  {4, 6, 8, 10, 15}  and  C3  =  {2, 5, 9, 12  , 13, 14, 16}. 

The  approximations  and  boundaries  of  the  three  concepts  are:  IiOWer(Ci)  = 
{7,11},  Iupper(Ci)  =  {1,2,3,4,7,11},  Bound(Ci)  =  {1, 2, 3, 4},  Jioiuer(C2)  = 
{6, 10},  /1ipper(C2)  =  {1,2,3,4,6,8,9,10,15,16},  Bound{C2 )  =  {1, 2, 3, 4, 8, 9, 
15,  16 },hower(C3)  =  {5, 12, 13, 14},  IUpper{Cz)  =  {1,2,3,4,5,8,9,12,13,14,15, 
16},  Bound(C3)  =  {1, 2, 3, 4, 8, 9, 15, 16}. 
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Only  one  indiscernible  subset  of  instances,  i.e.,  {12, 13},  which  belongs  to  the 
lower  approximation  of  the  concept  C3  is  considered  during  the  redundancy  re¬ 
duction  phase.  It  is  replaced  by  a  randomly  selected  instance  from  {12, 13}.  The 
global  boundary  set,  i.e.,  GB,  is  equal  to  the  set  {1,2,3,4,8,9,15,16}.  Conse¬ 
quently,  the  set  of  indiscernible  instances  subset  in  terms  of  condition  attributes 
is  GB/Id  =  {{1, 2, 3, 4},  {8, 9, 15, 16}}.  Let’s  consider  a  subset  T}  of  indiscer¬ 
nible  instances  belonging  to  GB  such  that  T\  =  {1,2, 3, 4}.  According  to  the 
definition  of  the  operator  6  we  obtain  6(T\)  =  {G’i,  C2,  C3}.  The  voting  proce¬ 
dure  produces  the  dominant  conflicting  concept  ip(T\)  =  {C}}.  Consequently, 
we  replace  the  subset  T}  by  a  randomly  selected  instance  from  C\  =  {1,3}.  Si¬ 
milarly,  we  apply  the  same  procedure  to  the  subset  T2  =  {8, 9, 15, 16},  which  will 
be  replaced  by  two  instances  randomly  selected  to  represent  respectively  {9, 16} 
and  {8, 15}. 

4  Optimization  for  Large  Datasets 

We  cannot  directly  apply  the  conflict  normalization  method  described  before 
to  large  datasets  because  its  time  complexity  is  quadratic.  To  deal  with  this 
problem  we  have  reduced  the  number  of  comparisons  necessary  to  compute  ap¬ 
proximations  and  to  determine  conflicting  data.  To  achieve  this  goal  we  propose 
an  incremental  clustering  algorithm.  Given  a  indiscernibility  threshold  a  and  a 
maximal  number  of  clusters,  this  algorithm  is  described  in  two  steps:  (1)  Buil¬ 
ding  clusters:  the  content  of  each  cluster  is  summarized  by  a  vector  of  values, 
noted  D*.  Each  entry  D*  of  this  vector  represents  the  most  frequented  value 
of  the  attribute  q  ( q  €  l...|D|)  in  the  current  cluster.  This  step  is  achieved  by 
comparing  incrementally  all  objects  with  the  representative  vector  of  each  built 
cluster.  If  they  are  a-indiscernible  then  the  current  object  is  inserted  in  the  clu¬ 
ster  and  the  vector  D  is  updated,  otherwise,  a  new  cluster  is  created;  (2)  after 
building  the  clusters,  we  apply  the  normalization  process  on  each  clusters,  which 
is  achieved  by  choosing  the  nearest  object  of  D*  of  dominant  concepts,  i.e.,  by 
applying  the  ip  operator  on  each  cluster. 

CCDN- Algorithm:  Our  algorithm,  called  CCDN- Algorithm  (Clustering  based 
Conflicting  Data  Normalization  Algorithm)  can  be  summarized  as  follows: 


Input  N  :  Training  set  size;  M  :  Maximal  number  of  clusters; 

D  :  Predictive  attributes;  C  :  Class;  O'  :  indiscernibility  threshold; 

Output  ReducedlnstancesSet; 
for  i:=l  to  N  Insert (i,Cluster,0:); 

/*  Insert  the  l  object  in  a  cluster  among  already  created  clusters.*/ 

/*  Let  ClusterNumber  be  the  number  of  created  clusters  (ClusterNumber  <  M)*/ 
for  j:=l  to  ClusterNumber  Add(ReducedInstancesSet,BestObjects(Cluster[i])); 

/*  Select  the  best  object(s)  from  each  cluster  */ 

Return  (ReducedlnstancesSet); 
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In  order  to  evaluate  the  cost  of  this  algorithm  let  us  assume  that  there  are  N 
objects  and  the  number  of  allowable  clusters  is  M  (M  «  N).  For  the  insert 
function,  in  the  worst  case,  the  number  of  comparisons  to  insert  an  object  in 
a  cluster  among  M  ones  is  equal  to  M.  So,  to  process  all  objects,  the  maxi¬ 
mal  number  of  comparisons  is  NM.  For  the  second  function,  in  order  to  select 
the  best  objects  of  each  clusters,  we  need  N  comparisons.  Consequently,  the 
complexity  of  CCDN  algorithm  is  O(NM). 

5  Experimental  Results 

In  order  to  evaluate  the  proposed  instance  selection  method  we  run  experiments 
on  12  real-world  datasets  taken  from  the  UCI  Irvine  repository  [2],  their  charac¬ 
teristics  are  summarized  in  Table  2. 

Table  2.  Datasets  considered:  the  size  of  training  set  and  test  set,  the  number  attri¬ 
butes,  class  cardinality  and  the  percentage  of  numeric  attributes. 


Base 

Train 

#class 

Base 

222 

#class 

EEf 

Australian 

552 

138 

14 

2 

43 

Mushroom 

6499 

1625 

2 

o 

Pima 

615 

153 

8 

2 

Pendegit 

7494 

3498 

H 

Vehicle 

677 

169 

18 

4 

S 

Letter 

16 

26 

Segment 

1848 

462 

7 

95 

Adult 

32561 

16281 

14 

2 

43 

Abalone 

3133 

8 

3 

88 

Shuttle 

43500 

14500 

8 

7 

ililill 

Annthyroid 

3772 

3428 

21 

3 

29 

Covertype 

387342 

193670 

54 

7 

18 

Original  data  are  transformed  using  discretization  method  proposed  by  Van 
de  Merckt  in  [13].  We  have  used  C4.5  system  [10]  to  construct  a  decision  tree 
from  both  the  original  data  and  the  selected  instances.  We  separate  randomly 
the  original  dataset,  which  do  not  contains  specific  test  set  into  training  set 
(80%)  and  test  set  (20%);  for  Covertype  dataset,  we  use  66%  for  training  and 
33%  for  testing.  Firstly,  our  method  is  used  to  select  a  subset  of  instances.  Its 
results  are  compared  with  random  and  stratified  sampling  methods.  For  these 
latter  methods,  we  repeat  sampling  and  data  mining  10  times  and  we  present 
average  results.  The  size  of  considered  samples  is  determined  by  our  method. 

In  order  to  evaluate  the  cost  of  our  proposed  algorithm,  we  draw  the  time 
evolution  of  conflicting  data  identification  for  the  largest  dataset,  i.e.,  Covertype 
dataset.  We  have  shown  that  the  complexity  of  the  algorithm  CCDN  is  linear 
(Section  4).  Fig.  1  underlines  this  linearity  feature  with  the  Covertype  dataset 
considering  48  attributes,  i.e.,  a  =  48/54  =  0.89. 

The  results  drawn  in  Table  3  show  that  the  size  of  returned  sample  is  lower 
than  25%  for  original  datasets  for  11  datasets  among  12,  it  is  lower  than  10% 
for  7  datasets. 

The  instances  selected  using  the  conflicting  data  normalization  based  method 
lead  to  a  model,  which  is  more  accurate  than  the  ones  extracted  from  a  sample 
using  random  or  stratified  sampling  technique,  the  differences  equal  5.36%  and 
4.28%  respectively.  However,  this  is  not  the  only  contribution  of  our  work.  In 
fact,  the  size  of  the  sample  is  generally  given  by  the  user  before  the  selection 
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Fig- 1.  Time  evolution  for  Covertype  dataset 


of  instances.  The  problem  is,  which  size  to  choose?  Our  method  does  not  need 
this  information,  it  is  part  of  its  result.  Besides,  we  have  introduced  a  formal 
notion  based  on  conflicting  data,  which  can  play  an  important  role  for  data 
understanding. 

Table  3.  The  accuracy  of  C4.5,  the  percentage  of  selected  instances,  the  accuracy 
of  C4.5  using  selected  instances  and  C4.5  accuracy  (with  the  same  percentage)  using 
random  and  stratified  sampling. 


Dataset 

CCDN-C4.5 

Rand-C4.5 

Australian 

85.5 

7.25 

85.5 

80.53±3.0 

82.19il.5 

Pima 

76.5 

10.57 

77.1 

71.62±2.1 

72.35il.6 

Vehicle 

71.0 

12.85 

72.2 

64.15T2.8 

63.53T2.3 

Segment 

92.4 

16.72 

92.4 

83.31±10.7 

87.37T1.1 

Abalone 

63.6 

63.4 

58.21T2.0 

58.51A1.4 

Annthyroid 

94.0 

2.82 

93.2 

93.57±0.4 

92.94T0.2 

Mushroom 

100 

0.46 

91.78i4.20 

89.85T5.0 

Pendegit 

91.7 

23.87 

89.9 

84.77T7.3 

87.44T0.5 

Letter 

79.4 

46.60 

76.7 

67.91T8.0 

74.25±0.3 

Adult 

85.4 

2.50 

84.1 

83.76±1.2 

83.37i0.7 

Shuttle 

99.8 

0.16 

99.7 

96.65±2.1 

95.33T2.3 

Covertype 

62.2 

1.66 

62.5 

54.94±1.21 

57.03i0.9 

MEAN 

83.46 

82.98 

77.60 

78.70 

The  size  of  the  sample  produced  by  CCDN-Algorithm  can  be  very  low  with¬ 
out  decreasing  the  quality  of  the  knowledge  induced  from  the  selected  instances. 
For  instance,  only  68  instances  (0.16%)  are  selected  from  Shuttle  dataset,  which 
contains  43500  instances.  The  quality  of  classification  is  the  same  for  both  the 
selected  instances  and  all  data.  Also,  only  0.46%  are  selected  from  mushroom 
original  data  and  the  accuracy  is  decreased  with  only  1%.  The  most  important 
result  is  obtained  with  the  largest  dataset,  i.e.,  Covertype  dataset.  Among  387342 
instances,  our  algorithm  chooses  only  1.66%  (6445  instances)  and  the  accuracy 
of  the  reduced  model  is  slightly  improved  comparing  with  original  model. 
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6  Conclusion 

This  paper  tackles  the  problem  of  mining  efficiently  a  very  large  data  and  pro¬ 
poses  two  solutions  based  on  instances  selection  via  conflict  normalization.  The 
proposed  method  is  developed  in  a  conceptual  framework,  which  is  an  exten¬ 
sion  of  Rough  Set  Theory  called  a- Rough  Sets.  The  contribution  of  this  paper  is 
threefold  (1)  formalization  of  a  notion  of  conflicting  concept,  (2)  proposition  of  a 
method  for  conflict  normalization  to  select  a  subset  of  instances,  (3)  proposition 
of  a  heuristic  algorithm  to  avoid  the  quadratic  complexity,  (4)  presentation  of 
results  of  extensive  experiments  on  different  datasets. 
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Abstract.  This  paper  focuses  on  a  performance  comparison  of  two  rule  matching 
(classification)  methods,  used  in  data  mining  systems  AQ15  and  LERS.  All  rule 
sets  used  in  our  experiments  were  induced  by  the  LERS  (Learning  from  Examples 
using  Rough  Sets)  system  from  ten  typical  input  data  sets.  Then  these  rule  sets 
were  truncated  using  three  different  criteria:  t-weight,  u-weight  and  the  strongest 
rule.  The  truncation  process  was  performed  using  six  different  cut-off  values  for  t- 
weight,  six  different  cut-off  values  for  n-weight  and  using  the  strongest  rule  option. 
Hence  for  each  of  the  input  rule  files  thirteen  truncated  rule  sets  were  created. 
Performance  was  measured  by  a  classification  error  rate.  The  objective  of  this 
study  was  to  determine  the  best  overall  method  of  classification  and  the  best 
truncation  option. 

Keywords.  Knowledge  discovery  and  data  mining,  rule  induction,  classification 
systems,  rule  truncation,  AQ15,  LERS,  ten-fold  cross  validation,  Wilcoxon 
matched-pairs  signed  rank  test. 


1  Introduction 

In  this  paper  input  data  were  presented  in  the  form  of  a  table,  called  a  decision  table. 
The  columns  are  labeled  by  variables.  One  of  the  variables  is  called  a  decision  and 
the  remaining  variables  are  called  attributes.  The  rows  represent  examples.  A 
concept  is  defined  as  a  set  of  all  examples  having  the  same  decision  value.  All 
examples  belonging  to  the  concept  C  are  called  positive  examples  for  C  and  all 
remaining  examples  are  called  negative  examples  for  C.  The  concepts  are  described 
in  the  form  of  rule  sets  [3,  9], 

Like  many  other  data  mining  systems,  AQ15  and  LERS  were  primarily  designed  to 
induce  rules  from  training  examples.  Both  systems  are  equipped  with  modules  for 
rule  matching  used  for  classification  of  new,  unseen  examples. 

In  our  research  all  rule  sets  used  for  experiments  were  induced  by  the  LEM2 
algorithm  [4]  of  LERS,  since  the  rule  matching  module  of  LERS  would  not  recognize 
rules  in  the  format  of  AQ15  and  its  successors,  AQ17  and  AQ18.  Thus  we  compare 
only  rule  matching  (classification)  performance  of  AQ15  and  LERS,  restricted  to  the 
rule  sets  induced  by  LERS. 

The  rule  set,  induced  by  the  inductive  learning  process,  may  be  used  for 
interpreting  regularities  hidden  in  the  input  data,  for  visualization  of  these  regularities. 
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or  for  classification  of  unseen  examples  [5,  10,  14].  During  the  classification  process 
of  unseen  data,  an  attempt  is  made  by  all  rules  to  match  each  example.  For  each 
example,  the  following  are  different  possible  outcomes  of  the  classification  process: 

•  The  example  is  exclusively  and  correctly  classified  as  a  member  of  the  correct 
concept, 

•  The  example  is  exclusively  and  incorrectly  classified  as  a  member  of  a  wrong 
concept, 

•  The  example  is  correctly  classified  and  incorrectly  classified  at  the  same  time, 
i.e.,  some  rules  classify  it  as  a  member  of  the  correct  concept,  while  other  rules 
classify  it  as  a  member  of  a  wrong  concept, 

•  The  example  is  not  classified  by  any  rule. 

An  example  is  completely  classified  by  a  rule  if  all  the  attribute- value  pairs  of  the 
rule  match  the  attribute  values  of  the  example.  The  example  is  partially  classified  if 
only  some  of  the  attribute- value  pairs  of  the  rule  match  the  attribute  values  of  the 
example.  The  example  is  not  classified  at  all  if  none  of  the  attribute  values  of  the 
example  match  any  of  the  attribute-value  pairs  for  all  rules. 

Truncation  [10]  is  a  method  of  reducing  the  rule  set  by  deleting  weak  rules, 
describing  a  few  training  examples.  Concepts  can  match  different  examples  with 
varying  degrees  of  precision  and  have  context-dependent  meaning.  Instead  of  seeking 
a  strict  match,  the  system  determines  the  degree  of  similarity  between  the  concept 
description  and  the  given  example,  and  compares  it  with  the  results  from  matching  the 
example  with  other  concept  descriptions.  The  concept  that  gives  the  best  match  is 
assigned  to  the  example.  Michalski's  truncation  algorithm  was  originally  designed  for 
AQ15,  but  the  algorithm  works  for  rule  sets  generated  by  LERS  as  well. 

This  paper  presents  a  performance  comparison  of  AQ15  and  LERS  classification 
systems  with  rule  set  truncation.  In  our  experiments  we  used  the  following 
assumptions: 

•  All  rule  sets  were  induced  by  the  LERS  system, 

•  Michalski’s  truncation  algorithm  was  used  to  prune  the  rule  sets, 

•  10- fold  cross  validation  process  [15]  was  used  to  validate  results, 

•  AQ15  and  LERS  were  used  to  classify  the  unseen  data  with  the  truncated  rule 
sets, 

•  Wilcoxon  matched-pairs  signed  rank  test  [7]  was  used  to  compare  the  AQ15  and 
LERS  classification  systems. 

2  Michalski’s  Rule  Truncation  Algorithm 

In  AQ15  [10],  each  rule  is  associated  with  a  pair  of  weights:  t  and  u,  representing  the 
total  number  of  training  examples  correctly  classified  by  the  rule,  and  the  number  of 
training  examples  uniquely  and  correctly  classified  by  the  rule,  respectively.  The  t- 
weight  may  be  interpreted  as  a  strength  of  a  rule,  an  idea  used  also  in  the  LERS 
classification  method,  while  ((-weight  is  a  measure  how  much  the  rules  differ  from 
each  other.  The  rule  with  the  highest  (-weight  may  be  interpreted  as  describing  the 
most  typical  examples  of  the  concept,  while  rules  with  the  lowest  M-weights  can  be 
viewed  as  describing  exceptional  examples  [10]. 
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In  AQ15  there  are  two  methods  of  recognizing  the  concept  membership  of  an 
example:  the  strict  match  and  the  flexible  match.  In  the  strict  match,  an  example  must 
satisfy  all  conditions  of  the  rule.  In  the  flexible  match,  a  degree  of  similarity  between 
the  example  and  the  rule  is  determined. 

During  the  truncation  process  [10],  we  remove  weak  rules,  with  the  value  of  t- 
weight  or  w-weight  not  exceeding  some  cut-off  and  applying  flexible  matching  to 
classify  an  example.  By  removing  the  weak  rules  the  total  number  of  rules  describing 
the  concept  is  reduced.  This  may  result  in  rules  that  may  not  match  the  examples 
completely  as  they  would  have  before  the  truncation  process.  Thus,  a  truncated  rule 
set  is  simpler  but  it  requires  a  more  sophisticated  classification  method:  flexible 
matching  is  used  to  classify  the  example.  By  applying  a  flexible  match  the  example 
may  still  be  very  closely  related  to  the  correct  concept  and  thus  may  be  correctly 
recognized.  An  interesting  problem  is  to  test  how  the  rule  set  truncation  method 
affects  the  accuracy  of  classification.  Results  of  similar  research  were  reported  in  [1], 

3  AQ15  Classification  System 

The  original  classification  module  of  AQ15  was  called  ATEST  [13].  AQ15  provided 
three  methods  of  rule  testing.  In  our  paper  we  re- implemented  and  tested  one  of  them, 
the  method  described  in  [10].  In  this  method,  during  the  process  of  recognizing  the 
example  against  a  set  of  rules,  there  are  three  possible  outcomes  [10]: 

•  Only  one  rule  may  classify  the  example  (SINGLE_MATCH  case), 

•  More  rules  than  one  rule  classify  the  example  (MULTIPLE_MATCH  case), 

•  No  rule  recognizes  the  example  (NOJVIATCH  case). 

When  recognizing  the  examples,  each  of  the  above  categories  requires  a  different 
evaluation  procedure. 

In  the  SINGLE_MATCH  case  the  classification  is  straightforward.  If  the  rule 
decision  is  equal  to  the  known  decision  for  the  example,  the  example  is  counted  as 
correctly  classified.  If  not  it  is  considered  as  wrongly  classified.  In  the  case  of 
MULTDPLE_MATCH  and  NO_MATCH  cases,  the  classification  procedures  are  more 
complicated. 

MULTIPLE_MATCH  case:  In  this  case  there  are  more  rules  than  one  that  classify 
the  example.  The  system  selects  the  most  probable  decision.  Let  us  consider  n 
concepts,  C\,  C%  ...,  C„,  that  classify  the  example  e.  Each  concept  Q  is  described 
by  a  rule  set.  In  AQ15,  a  rule  is  called  a  complex ,  and  it  is  said  that  the  rule  set  is  a 
disjunction  of  complexes  ( Cpx ),  each  complex  (rule)  in  turn  is  a  conjunction  of 
selectors  ( Sel ).  The  estimate  of  probability,  EP  of  a  concept  C,  is  defined  as  the 
probabilistic  sum  of  EPs  of  its  complexes.  If  the  rule  set  for  Q  consists  of  a 
disjunction  of  two  complexes  Cpx\  and  Cpx 2,  then  the  corresponding  estimate  of 
probability  is  computed  in  the  following  way: 

EP(Cj,  e)  =  EP(Cpx  1,  e)  +  EP(Cpx2,  e)  -  EP{Cpx\,  e)  *  EP(Cpx 2,  e), 

where  EP  of  a  complex  Cpxj  in  the  context  of  the  example  e  is  the  ratio  of  the  total 
number  of  positive  examples  classified  by  the  complex  Cpxj  (i.e.,  the  weight  of  Cpxj) 
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to  the  total  number  of  training  examples,  if  the  complex  recognizes  the  example  e,  and 
it  is  equal  to  0  otherwise: 


fWeight  (Cpxj) 
EP{Cpxj,  e)=  \  #  examples 

lo 


if  complex  Cpxj  recognizes  example  e, 
otherwise. 


The  most  probable  concept  is  the  one  with  the  largest  EP. 

NO_MATCH  case:  In  this  case  there  are  no  complexes  that  classify  the  example  e. 
The  system  uses  flexible  matching  to  determine  the  best  complex  that  suggests  the 
most  probable  decision.  One  way  to  perform  such  flexible  matching  is  to  measure  the 
fit  between  attribute  values  of  the  example  and  the  concepts.  A  measure  of  fit  ( MF)  is 
defined  as  follows:  MF  of  a  concept  Q  to  an  example  e  is  computed  as  a  probabilistic 
sum  for  a  disjunction  of  all  complexes.  Let  us  say  that  the  concept  Q  consists  of  a 
disjunction  of  two  complexes  Cpx\  and  Cpx 2,  the  measure  of  fit  for  C/  is  defined  as 


[10]: 


MF(Q,  e)  =  MF(Cpxh  e)  +  MF(Cpx2 ,  e)  -  MF(Cpx\,  e )  *  MF(Cpx2,  e ), 

where  MF  of  a  complex  Cpx;  to  an  example  e  is  defined  as  the  product  of  MFs  for  a 
all  selectors  of  Cpxj,  weighted  by  the  proportion  of  training  examples  covered  by  Cpxj 


MF{Cpxj)  -  n  unsdt,  e)  * 

where  Weight  (Cpxj)  is  the  number  of  training  examples  covered  by  Cpxj.  MF  of  a 
selector  Sel £  and  an  example  e  is  1  if  the  selector  is  satisfied  by  the  example,  i.e.,  if 
one  of  the  example’s  attribute  values  is  equal  to  the  selector  values.  If  no  selector 
value  is  equal  to  the  attribute  value  of  the  example,  its  MF  is  proportional  to  the 
amount  of  the  decision  space  covered  by  the  selector,  i.e.,  it  is  the  ratio  of  the  number 
of  attribute  values  in  the  selector  to  the  total  number  of  all  possible  values  of  the 
attribute: 

{1  if  selector  Sel *  is  satisfied  by  e, 

-  *Values  -  otherwise. 

Domain  size 

Note  that  the  measure  of  fit,  MF,  is  a  generalization  of  the  estimate  of  probability 
EP.  When  all  selectors  in  a  complex  are  satisfied,  the  measure  of  fit  is  equal  to  the 
estimate  of  probability  [10].  There  exists  another  possibility  of  defining  MF(Cpxj).  In 
AQ18  this  feature  is  extended  to  selectors  [8], 
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4  LERS  Classification  System 

Rule  sets  were  induced  using  the  LEM2  algorithm  of  the  LERS  system  [4],  In  LERS, 
inconsistencies  in  training  data  are  handled  using  rough  set  theory  [11,  12].  Data  are 
consistent  if  for  any  two  different  examples  with  all  attribute  values  the  same,  the 
decision  values  are  also  the  same. 

When  the  data  are  completely  classified,  LERS  uses  the  following  factors:  strength, 
specificity  and  support  to  classify  an  example  [5].  The  original  approach  was 
introduced  in  [2,  6].  When  data  are  partially  classified,  LERS  uses  an  additional 
factor  called  a  matching  factor  to  determine  the  best  concept  to  classify  the  example. 
Strength  is  a  measure  of  how  well  the  rule  has  performed  during  classification  of 
training  data.  It  is  computed  as  the  number  of  examples  correctly  classified  by  the 
rule.  Obviously,  rules  that  correctly  classified  more  examples  are  stronger. 
Specificity  is  a  measure  of  complexity  of  the  rule.  Rules  with  larger  numbers  of 
attribute- value  pairs  are  more  specific.  Matching  factor  is  a  measure  of  how  well  the 
rule  matched  the  attribute  values  of  an  example.  It  is  computed  as  the  ratio  of  the 
number  of  matched  attribute- value  pairs  of  a  rule  to  an  example  to  the  total  number  of 
attribute-value  pairs  of  the  rule. 

Support  is  computed  as  the  sum  of  scores  for  all  matching  rules  from  one  concept. 
It  is  defined  as  follows: 


2,  Strength!/?)  *  Specificity!/?)  *  Matching  factor(/?) 

partially  matching  rules  R  describing  C 

The  concept  C  for  which  support  is  the  largest  is  a  winner  and  the  example  is 
classified  as  being  a  member  of  C. 

5  Experiments 

There  were  ten  typical,  well-known  data  files  used  to  compare  the  performance  of  the 
two  classification  methods.  The  basic  facts  of  the  data  files  are  described  in  Table  1. 


Table  1 


Data  file 

Number  of 
examples 

Number  of 
attributes 

Number  of 
concepts 

Missing 
attribute  values 

lymphography 

148 

18 

4 

no 

breast-cancer 

286 

9 

2 

yes 

iris 

150 

4 

3 

no 

hepatitis 

155 

19 

2 

yes 

soybean 

307 

35 

19 

yes 

primary-tumor 

339 

17 

21 

yes 

house 

435 

16 

2 

yes 

Wisconsin 

625 

9 

9 

no 

mammography 

1284 

12 

2 

no 

bupa 

345 

6 

2 

no 
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Table  2 


Method  lymphography  breast-cancer  iris 


LERS 

AQ15 

LERS 

AQ15 

LERS 

AQ15 

t=  1 

20.53 

16.89 

32.87 

32.87 

4.67 

4.67 

t  =  2 

18.24 

17.57 

30.42 

30.42 

4.67 

5.33 

t  =  3 

19.59 

18.92 

28.67 

29.02 

4.67 

6.00 

t  =  4 

17.57 

16,89 

28.32 

28.67 

3.33 

5.33 

t  =  5 

17.57 

16.22 

27.97 

28.32 

4.00 

4.67 

t=  10 

21.62 

18.92 

27.97 

27.62 

4.00 

5.33 

u  =  1 

18.24 

16.89 

32.87 

32.87 

4.67 

4.67 

u  =  2 

18.92 

15.54 

30.77 

31.12 

4.00 

8.67 

u  =  3 

24.32 

22.30 

28.67 

28.32 

11.33 

23.33 

u  =  4 

27.03 

23.65 

28.67 

27.62 

26.67 

29.33 

u  =  5 

26.35 

26.35 

29.37 

27.27 

28.67 

31.33 

u=  10 

52.70 

47.30 

- 

- 

30.00 

32.67 

Strongest 

25.00 

25.00 

35.92 

27.88 

6.00 

10.00 

This  research  is  focused  on  the  classification  performance  of  two  methods  AQ15 

and  LERS. 

,  For  each  of  the  ten  data  files,  the  following  options  were  used  to  compare 

the  performance  of  AQ15  and  LERS: 

•  Truncate  option  =  t- weight  with  cut-off  weight  =  1,  2,  3, 4,  5,  and  10, 

•  Truncate  option  =  u- weight  with  cut-off  weight  =  1,  2,  3, 4,  5,  and  10, 

•  Truncate  option  =  strongest  rule  (each  concept  is  described  by  the  rule  with 
largest  t- weight). 


Table  3 


Method  hepatitis  soybean  primary-tumor 

LERS  AQ15  LERS  AQ15  LERS  AQ15 


t=  1 

19.35 

18.71 

18.57 

t  =  2 

19.35 

18.71 

18.24 

t  =  3 

19.35 

18.71 

18.57 

t  =  4 

20.65 

19.35 

19.54 

t  =  5 

18.71 

19.35 

22.15 

t=  10 

20.00 

20.65 

49.19 

u  =  1 

20.65 

18.71 

18.57 

u  =  2 

22.58 

20.00 

19.87 

u  =  3 

23.87 

19.35 

23.13 

u  =  4 

23.87 

20.65 

25.08 

u  =  5 

31.61 

20.65 

24.76 

u=  10 

- 

- 

53.09 

Strongest 

44.92 

20.86 

24.10 

18.89 

64.60 

64.60 

18.89 

64.01 

64.31 

19.22 

65.19 

65.19 

20.52 

65.78 

65.72 

21.50 

68.14 

68.44 

50.49 

70.50 

71.09 

18.89 

64.60 

64.60 

18.89 

64.01 

64.90 

21.82 

68.73 

68.14 

24.76 

69.03 

69.32 

25.08 

69.91 

70.21 

58.31 

82.89 

85.25 

29.97 

66.96 

73.4 
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For  each  of  the  data  files  and  the  thirteen  truncation  options,  the  average  error  rates 
for  AQ15  and  LERS  classification  systems  were  computed  using  the  Wilcoxon 
matched-pairs  signed  rank  test.  Results  of  our  experiments  are  presented  in  Tables  2- 
4. 

In  Tables  2-4, indicates  that  truncation  resulted  in  elimination  of  all  rules  (rules 
were  too  short,  i.e.,  not  specific  enough). 


6  Conclusions 

The  following  conclusions  can  be  derived  from  our  experiments,  using  the  Wilcoxon 
matched-pairs  signed  rank  two-tailed  test  with  a  5%  significance  level.  Overall,  for 
every  individual  input  data  file  and  for  all  thirteen  results  of  different  truncation 
options,  no  method  performed  significantly  better  than  any  other. 

Using  the  same  Wilcoxon  matched-pairs  signed  rank  two-tailed  test  with  a  5% 
significance  level;  for  five  input  data  files,  one  of  the  classification  methods 
performed  better  than  the  other;  for  the  remaining  five  input  data  files  a  significant 
difference  in  performance  did  not  occur.  In  the  five  cases  where  one  of  the 
classification  methods  performed  better  than  the  other,  for  two  input  data  files  AQ15 
performed  better  with  the  files  lymphography  and  iris,  for  the  remaining  three  input 
data  files  (iris,  primary-tumor  and  mammography)  LERS  performed  better.  Thus 
these  two  classification  systems  do  not  differ  significantly. 

Similarly,  we  tested  if  any  of  the  classification  methods  were  better  in  conjunction 
with  any  of  the  thirteen  specific  truncation  methods  used  in  our  experiments.  None  of 
these  thirteen  truncation  methods  resulted  in  any  significant  difference  in  performance 
of  the  two  classification  systems. 

However,  for  a  specific  data  set  we  may  observe  a  difference  in  performance 
between  the  two  classification  methods.  Also,  for  a  specific  rule  set  some  truncation 
may  result  in  better  performance.  Therefore,  we  may  conclude  that  for  any  specific 

Table  4 

Method  house  Wisconsin  mammography  bupa 


LERS  AQ15  LERS  AQ15  LERS  AQ15  LERS  AQ15 


t=  1 

6.44 

6.44 

22.24 

22.56 

31.54 

32.71 

36.81 

37.68 

t  =  2 

6.44 

6.44 

19.84 

20.00 

31.78 

32.40 

37.39 

38.26 

t  =  3 

6.44 

6.44 

18.88 

18.88 

31.78 

32.17 

39.13 

37.68 

t  =  4 

6.44 

6.44 

17.92 

17.92 

32.63 

32.71 

40.58 

39.71 

t  =  5 

6.21 

6.44 

17.92 

17.92 

33.96 

33.80 

42.61 

40.58 

t=  10  6.67 

6.44 

22.24 

17.92 

38.47 

38.63 

42.03 

42.03 

u  =  1 

6.44 

6.44 

22.24 

22.56 

31.54 

32.71 

36.81 

37.68 

u  =  2 

5.75 

5.75 

19.84 

20.00 

31.85 

33.02 

37.1 

36.81 

u  =  3 

5.29 

7.13 

18.56 

18.40 

29.91 

31.85 

38.55 

37.39 

u  =  4 

5.75 

8.97 

19.20 

17.92 

31.00 

32.87 

40.58  42.32 

u  =  5 

7.36 

6.90 

21.92 

17.92 

30.67 

33.96 

44.35 

42.32 

u=  1012.87 

18.62 

27.20 

17.92 

34.35 

39.80 

- 

— 

Strongest7.36  14.94 

31.36 

19.84 

34.81 

34.03 

41.16 

39.71 
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data  set  a  classification  system  and  a  truncation  method  should  be  selected 
individually.  Hence,  there  is  no  best  universal  approach  to  classification  of  unseen 
cases  and  truncation  of  rule  sets. 
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Abstract.  A  good  deal  of  progress  has  been  made  in  the  past  few  years 
in  the  design  and  implementation  of  control  programs  for  autonomous 
agents.  A  natural  extension  of  this  work  is  to  consider  solving  difficult 
tasks  with  teams  of  cooperating  agents.  Our  interest  in  this  area  is  moti¬ 
vated  in  part  by  our  involvement  in  a  Navy-sponsored  micro  air  vehicle 
(MAV)  project  in  which  the  goal  is  to  solve  difficult  surveillance  tasks 
using  a  large  team  of  small  inexpensive  autonomous  air  vehicles  rather 
than  a  few  expensive  piloted  vehicles.  Our  approach  to  developing  control 
programs  for  these  MAVs  is  to  use  evolutionary  computation  techniques 
to  evolve  behavioral  rule  sets.  In  this  paper  we  describe  our  architecture 
for  achieving  this,  and  we  present  some  of  our  initial  results. 


1  Introduction 

One  of  the  most  challenging  aspects  of  building  intelligent  systems  is  the  design 
and  implementation  of  control  programs  for  intelligent  autonomous  agents.  Ma¬ 
nually  designing  and  implementing  control  programs  that  are  sufficiently  robust 
to  handle  dynamically  changing  environments  and  uncertainty  has  proved  to  be 
extremely  difficult.  As  a  consequence,  there  has  been  considerable  interest  in  the 
use  of  machine  learning  techniques  to  help  automate  this  process. 

A  good  deal  of  progress  has  been  made  in  this  area  in  the  past  few  years 
using  a  variety  of  representations  (rules,  neural  nets,  fuzzy  logic,  etc.)  and  a 
variety  of  learning  techniques  (symbolic,  reinforcement,  evolutionary,  etc.).  A 
natural  extension  of  this  work  is  to  consider  solving  difficult  tasks  with  teams  of 
cooperating  agents. 

Our  interest  in  this  area  is  motivated  in  part  by  our  involvement  in  a  Navy- 
sponsored  micro  air  vehicle  (MAV)  project  in  which  the  goal  is  to  solve  difficult 
surveillance  tasks  using  a  large  team  of  small  inexpensive  autonomous  air  vehicles 
rather  than  a  few  expensive  piloted  vehicles.  Our  approach  to  developing  control 
programs  for  these  MAVs  is  to  leverage  off  the  successes  in  using  evolutionary 
computation  techniques  to  evolve  behavioral  rule  sets  for  single-agent  systems. 
In  this  paper  we  summarize  related  work,  we  describe  our  architecture,  and  we 
present  some  of  our  initial  results.  We  conclude  with  a  discussion  of  future  work. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  157-165,  2000. 
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2  Background 

Our  approach  to  developing  teams  of  cooperating  agents  is  to  represent  agent 
behaviors  as  sets  of  rules  and  evolve  these  rule  sets  using  evolutionary  computa¬ 
tion  techniques.  There  has  been  a  good  deal  of  work  done  in  this  area  for  single 
agents,  but  not  cooperating  teams  of  agents.  At  the  same  time,  there  has  been 
work  done  on  “collective  robotics”  using  other  techniques.  In  this  section  we 
summarize  relevant  work  in  these  two  areas. 

2.1  Rule  Learning  Using  Evolutionary  Algorithms 

One  of  the  earliest  rule  evolving  approaches  is  Holland’s  classifier  system  [4].  In 
this  system  a  population  of  rules  is  maintained.  These  rules  both  compete  for 
space  and  priority,  while  also  cooperating  to  produce  an  appropriate  classification 
for  the  given  input. 

An  alternative  approach  is  to  maintain  a  population  of  rule-sets  which  can 
vary  in  length.  Examples  of  these  are  Smith’s  LS-1  system  [9],  and  the  GABIL 
system  which  uses  a  GA  for  concept  learning  [5].  Typically  these  types  of  systems 
build  rules  which  have  more  of  a  stimulus-response  quality. 

The  SAMUEL  system  [3]  [8] ,  arguably  one  of  the  more  successful  rule  evolving 
systems,  uses  an  interesting  hybrid  of  these  two  approaches.  Individuals  are 
implemented  as  rule-sets,  but  SAMUEL  also  uses  a  rule  bidding  system  and 
credit  assignment  mechanism  similar  to  those  found  in  a  classifier  system. 

Wu,  Schultz  and  Agah  implemented  a  rule  learning  system  for  MAVs  using 
a  GA  [11].  Their  GA  implementation  was  very  much  a  canonical  GA,  with  a 
binary  representation,  and  proportional  selection.  Fitness  was  measured  using  a 
simulated  environment,  and  each  individual  defined  a  variable  length  rule-set. 

2.2  Collective  Robotics 

Collective  robotics  involves  the  use  of  robot  teams  which  cooperate  to  perform 
a  task  or  set  of  tasks  [1].  Teams  have  several  inherent  advantages  including  the 
ability  to  distribute  themselves,  do  problem  decomposition,  and  perform  parallel 
processing. 

Robot  soccer  is  one  of  the  most  popular  domains  for  studying  collective  robo¬ 
tics.  Tucker  Balch  implemented  a  soccer  simulation  to  study  task  differentiation 
and  specialization  [2],  The  robots  were  trained  using  Q-learning,  and  they  would 
often  specialize  to  playing  either  a  defensive  or  offensive  position. 

Other  problem  domains  include  multi-robot  box  pushing  [6],  and  foraging 
[7]  tasks.  These  problems  are  often  solved  by  implementing  low  level  swarming 
behaviors  such  as  avoid  or  follow.  A  learning  algorithm  is  then  used  to  teach  the 
robots  to  select  behaviors  and  coordinate  with  other  robots. 

A  common  problem  in  all  these  experiments  was  getting  the  robots  to  coo¬ 
perate,  particularly  when  learning  algorithms  were  used.  In  each  case  the  resear¬ 
chers  found  that  evaluating  individuals  solely  on  their  own  performance  wasn’t 
enough.  Only  when  the  team  was  evaluated  as  a  whole  did  cooperation  occur. 
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3  Our  EA  Architecture 

Our  ultimate  goal  is  to  evolve  heterogeneous  team  of  specialized  agents  that 
collectively  perform  specific  tasks.  Our  strategy  for  accomplishing  this  is  to  start 
simple  and  incrementally  add  complexity.  Our  first  simplification  is  to  assume 
that  the  teams  consist  of  homogeneous  agents,  i.e.,  they  are  all  executing  the 
same  task  program.  This  allows  us  to  focus  on  evolving  a  single  program  which, 
when  simultaneously  executed  by  a  team  of  agents,  produces  collective  coope¬ 
rative  behavior,  and  it  allows  us  to  take  advantage  of  existing  work  on  evolving 
single  agent  behaviors. 

However,  there  are  still  a  number  of  important  design  decisions  that  need 
to  be  made,  such  as  how  rule  sets  are  represented  internally  in  our  EA,  how 
rule  sets  are  modified  over  time,  etc.  We  discuss  these  design  decisions  in  the 
following  subsections. 

3.1  Representation 

In  our  architecture  an  individual  in  the  population  represents  a  complete  set  of 
rules,  and  its  representation  is  a  string  in  which  all  the  rules  are  concatenated. 
The  ordering  of  the  rules  is  not  important.  From  generation  to  generation,  the 
length  of  individuals  in  the  population  will  tend  to  vary  in  size.  The  system  has 
parameters  which  define  a  minimum  and  maximum  size  for  an  individual,  as  well 
as  an  initial  size. 

Each  rule  is  a  fixed  length  binary  string.  Rules  are  composed  of  a  condi¬ 
tion  clause  and  an  action  clause.  The  bits  in  the  condition  clause  are  mapped 
to  the  agent’s  sensors,  while  the  bits  in  the  action  clause  are  mapped  to  the 
agent’s  actuators.  This  allows  each  agent  to  perceive  its  environment  and  take 
a  corresponding  action. 

The  rule  interpreter  used  by  each  agent  operates  as  follows.  In  any  given 
situation,  all  the  rules  are  compared  to  the  current  input  from  the  sensors,  and 
the  rule  that  has  the  highest  match  score  is  executed.  There  are  several  possible 
ways  of  doing  rule  matching.  For  simplicity  we  have  avoided  using  rule  weights 
and  bidding  techniques  such  as  in  SAMUEL  or  classifier  systems.  Rule  matching 
is  described  in  more  detail  in  the  description  of  the  agent  environment. 

3.2  Selection 

Our  population  management  scheme  is  different  from  a  typical  GA.  We  have  im¬ 
plemented  an  ES-like  model  involving  //  parents  and  A  offspring.  Parent  selection 
is  deterministic:  all  individuals  produce  the  same  fixed  number  of  offspring. 

The  selection  bias  in  our  architecture  is  implemented  using  survival  selection. 
In  an  ES  survivors  are  chosen  in  one  of  two  ways:  using  a  “+”  strategy  involving 
both  the  parent  and  child  populations,  or  using  a  strategy  involving  only 
the  child  population.  The  former  converges  more  rapidly  but  is  more  likely  to 
find  a  local  optimum,  while  the  latter  provides  a  broader  but  slower  search.  We 
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have  both  options  implemented,  and  experimentally  choose  the  one  best  suited 
for  the  particular  fitness  landscape. 

The  ES  community  typically  uses  truncation  selection  for  determining  survi¬ 
vors.  We  have  chosen  to  use  a  binary  tournament  instead  because  the  selection 
pressure  is  weaker,  allowing  for  more  exploration  early  in  the  search. 


3.3  Operators 

Since  the  internal  representation  is  binary,  we  use  a  standard  bit-flip  mutation 
operator.  We  also  implemented  both  a  1-point  and  a  2-point  crossover  operator. 

The  1-point  crossover  is  the  same  operator  as  the  one  used  by  Wu  et.  al. 
[11].  Crossover  can  only  occur  on  rule  boundaries.  Because  individuals  can  vary 
in  size,  this  crossover  differs  from  the  standard  crossover  operator  used  in  most 
GAs.  Instead  of  selecting  crossover  points  at  the  same  location  on  both  parents, 
different  crossover  points  are  selected  for  each.  This  means  that  each  child  may 
contain  either  more  or  fewer  rules  than  the  parents  which  spawned  them.  In  fact, 
this  is  the  only  mechanism  by  which  rule  sets  can  change  in  size. 

The  1-point  crossover  operator  does  relatively  little  mixing  of  parental  rules 
and  does  not  produce  any  new  rules.  Based  on  earlier  experience,  we  felt  it 
would  be  useful  to  have  a  more  disruptive  crossover  operator  available  as  well. 
We  chose  to  implement  a  2-point  crossover  operator  that  was  not  restricted  to 
crossing  on  rule  boundaries.  Crossover  points  are  chosen  by  first  picking  random 
rule  boundaries  in  both  parents,  just  as  with  the  1-point  crossover.  Then  a 
randomly  chosen  offset  is  applied  to  both  crossover  points  to  obtain  the  inter¬ 
rule  cut  point.  This  is  essentially  the  same  crossover  used  in  the  GABIL  system 
[5], 

3.4  Fitness 

The  fitness  of  a  particular  rule  set  is  obtained  via  simulation.  Agents  within  the 
simulation  use  the  rule  set  to  control  their  behaviors.  The  agents  have  a  task  to 
perform,  and  at  the  end  of  the  simulation  they  are  given  a  score  which  indicates 
how  well  the  task  was  performed.  Since  our  intention  is  to  have  these  agents 
cooperate,  they  are  all  evaluated  as  a  team,  and  all  receive  the  same  score.  In 
the  current  implementation,  all  agents  use  the  same  rule  set,  and  the  resulting 
score  from  the  simulator  is  used  as  the  fitness  of  that  rule  set. 

Without  any  sort  of  counteracting  force,  evolving  rule  sets  tend  to  grow 
uncontrollably,  very  much  the  way  Genetic  Programs  (GPs)  do  [10].  Parsimony 
pressure  is  used  to  discourage  this  growth  by  penalizing  the  fitness  of  larger 
individuals.  We  have  implemented  parsimony  pressure  with  the  same  approach 
used  by  Wu,  et.  al.  [11]  as  described  by  the  formula  f'{i)  =  f(i)  —  alif(i).  The 
interesting  thing  to  note  about  this  equation  is  that  the  penalty  gets  stronger 
as  the  raw  fitness  increases.  This  approach  allows  individuals  to  grow  larger 
early  in  the  process,  perhaps  improving  the  exploration  phase  of  the  search.  The 
particular  value  used  for  a  is  experimentally  determined. 
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4  Experimental  Methods 

4.1  Simulation 

Currently  all  of  our  experiments  involve  the  use  of  a  simple  micro  air  vehicle  MAV 
simulator.  The  simulation  environment  is  a  2-D  arena  surrounded  by  walls.  Nine 
identical  MAVs  are  placed  into  the  simulator,  and  are  allowed  to  move  and  turn 
on  each  timestep.  The  MAVs  are  like  helicopters  in  that  they  can  hover  or  move 
at  a  slow  constant  speed.  As  the  MAVs  move,  they  can  potentially  collide  with 
each  other  or  with  the  walls  surrounding  the  arena.  Any  MAV  that  is  involved 
in  a  collision  is  immediately  destroyed  and  removed  from  the  simulation. 

Each  MAV  has  8  sonar  sensors  placed  radially  around  the  vehicle.  These 
sensors  have  no  range  information.  They  return  either  a  0  or  a  1,  indicating 
whether  or  not  there  is  an  object  in  range  in  the  direction  the  sensor  is  pointing. 
The  sensor  range  can  be  adjusted  as  a  parameter  of  the  simulation. 

The  robots  also  have  a  surveillance  range.  They  can  ’’look”  down  and  observe 
objects  on  the  ground.  Currently  the  MAVs  pay  no  attention  to  what  they  are 
observing.  Their  only  goal  is  to  observe  as  much  of  the  ground  as  possible  at  any 
given  time. 

The  behavior  of  the  MAVs  is  defined  by  a  set  of  stimulus-response  rules. 
Each  rule  is  made  up  of  12  bits,  and  contains  a  condition  and  an  action  section. 
The  first  8  bits  are  the  condition  section,  with  one  bit  for  each  sensor.  At  each 
timestep  the  current  sensor  readings  are  compared  with  all  the  conditions  in  the 
rule  set.  The  rule  with  the  closest  match  is  the  winner.  If  there  is  a  tie,  a  winner 
is  chosen  randomly  from  among  the  best  matches.  There  is  also  a  minimum 
threshold  for  matches.  At  least  half  of  the  condition  bits  must  match  the  current 
sensor  configuration.  If  the  winning  rule  exceeds  this  threshold,  its  action  is 
executed. 

The  action  section  of  the  rule  consists  of  two  parts,  a  speed  and  a  turn  angle. 
The  speed  can  have  a  value  of  either  0  or  1,  where  0  indicates  that  the  plane  will 
not  move  in  the  current  timestep,  and  1  indicates  that  it  will.  The  second  part 
of  the  action  is  the  turn  angle.  This  indicates  the  number  of  degrees  the  plane 
will  turn  relative  to  its  current  heading. 

Figure  1  provides  a  more  concrete  picture  of  an  MAV  simulation  via  a  series 
of  three  snapshots  from  an  example  run  involving  a  reasonably  good  set  of  evol¬ 
ved  rules.  The  goal  in  this  case  is  for  a  team  of  nine  MAVS,  starting  from  an 
initial  configuration  on  the  left  edge  of  a  surveillance  area,  to  dynamically  confi¬ 
gure  itself  (without  collisions!)  in  such  a  way  as  to  obtain  maximal  surveillance 
coverage. 

The  simulator  is  stochastic  in  that  the  results  of  a  simulation  using  the  same 
rule  set  can  change  from  run  to  run,  resulting  in  a  “noisy”  fitness  evaluation. 
Consequently,  we  typically  run  an  individual  through  several  trials.  We  then 
assign  the  average  of  all  the  trials  as  the  fitness  for  the  individual. 
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5  Initial  Experimental  Results 

The  goal  of  our  initial  experiments  was  to  test  our  design  decisions,  tune  our 
system,  and  evaluate  its  ability  to  evolve  effective  rule  sets  for  teams  of  homo¬ 
geneous  agents.  We  describe  these  experiments  in  the  following  subsections. 

The  following  parameters  were  used  for  our  experiments  unless  stated  other¬ 
wise.  We  set  both  p  and  A  to  100.  The  number  of  trials  was  5,  while  the  crossover 
and  mutation  rates  were  1.0  and  0.001  respectively.  Individuals  were  limited  to 
between  1  and  200  rules,  with  an  initial  starting  size  of  5  rules. 

5.1  Population  Management 

Recall  that  we  have  both  “+”  and  population  management  strategies  avai¬ 
lable  for  use.  The  goal  of  our  first  set  of  experiments  was  to  determine  how 
sensitive  the  results  are  to  this  design  choice.  In  general  (//  +  A)  slightly  outper¬ 
formed  (/i,  A),  although  not  by  much.  As  a  consequence  we  adopted  the  (/i  +  A) 
strategy  for  the  remaining  experiments. 


5.2  Parsimony  Pressure 

Another  important  design  choice  is  the  amount  of  parsimony  pressure  used.  In 
general,  too  little  parsimony  pressure  allows  the  length  of  individuals  to  grow 
indefinitely,  and  too  much  parsimony  pressure  produces  compact  individuals 
with  suboptimal  fitness.  What  is  needed  is  a  pressure  point  in  between  these 
two  extremes.  The  goal  of  our  second  set  of  experiments  was  to  get  a  rough 
sense  of  how  parsimony  pressure  affected  our  system.  We  used  three  different 
values  for  parsimony  pressure  initially:  0,  1/2400  and  1/300.  Figures  2  and  3 
show  that  the  parsimony  pressure  does  work  as  expected.  Higher  parsimony 
pressures  tend  to  produce  smaller  individuals,  but  at  the  expense  of  fitness. 
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Average  length  of  the  best  individual 


Fig.  2.  Length  of  the  best  individual  aver¬ 
aged  over  5  runs  for  three  parsimony  pres¬ 
sures. 


Fig.  3.  Best-so-far  curves  of  raw  fitness 
averaged  over  5  runs  for  three  parsimony 
pressures. 


5.3  Crossover 

Recall  that  we  have  both  a  one  and  two  point  crossover  operator  implemented. 
The  experiments  so  far  have  used  the  default  1-point  crossover  operator.  Our 
third  set  of  experiments  involved  testing  the  sensitivity  of  the  results  to  these 
operators.  Since  we  felt  there  could  be  some  interaction  with  parsimony  pressure, 
we  tested  sensitivity  at  a  variety  of  pressure  points:  0,  1/24000,  1/2400,  1/1200, 
1/600  and  1/300.  Five  runs  are  performed  for  200  generations  at  each  parsimony 
pressure  value. 

In  figure  4  we  plot  the  raw  fitness  after  200  generations  versus  the  parsimony 
pressure  used.  Again  we  see  that  higher  parsimony  pressures  yield  individuals 
with  lower  raw  fitness  values.  However,  we  also  see  that  2-point  crossover  out¬ 
performs  the  1-point  crossover  consistently  at  all  levels  of  parsimony  pressure. 
Consequently,  we  made  2-point  crossover  the  default. 

5.4  Generalization 

Although  our  system  is  at  this  point  evolving  interesting  and  effective  rule  sets, 
figure  4  is  somewhat  disconcerting  in  that  fitness  declines  steadily  with  increasing 
parsimony  pressure.  Ideally,  one  would  hope  to  see  shorter  rule  sets  emerging 
with  more  general  rules  that  achieve  comparable  performance.  One  possible  ex¬ 
planation  for  why  we  don’t  see  this  is  that  the  rule  language  itself  is  not  well 
suited  for  generalization. 

To  test  this  we  added  classifier-like  wildcards  to  our  system  by  allowing  the 
genes  in  the  condition  section  of  the  rules  to  take  on  three  values:  0,  1  and 
We  also  modified  the  random  initialization  and  mutation  operators  so  that  we 
could  adjust  the  number  of  wildcards  in  our  individuals.  We  added  a  parameter 
called  ’’wildcard  ratio”  which  allows  us  to  adjust  the  bias  for  the  number  of 
wildcards  which  end  up  in  our  rules.  It  can  take  a  value  between  0  and  1.  A 
value  of  0.4,  for  example,  would  mean  that  on  average  40%  of  the  genes  in  the 
condition  sections  of  the  rules  will  be  a 
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Fig.  4.  Raw  fitness  vs.  parsimony  pressure 
is  plotted  for  both  the  1-point  and  2-point 
crossover  operators. 


Fig.  5.  Length  vs.  wildcard  ratio  bias  is 
plotted  using  3  different  parsimony  pres¬ 
sures:  1/24000,  1/2400,  1/600. 


We  ran  experiments  using  several  values  for  the  wildcard  ratio  and  parsimony 
pressure.  In  figure  5  we  plot  length  vs.  wildcard  ratio  for  the  three  different 
parsimony  pressures.  Each  point  on  the  graph  represents  the  best  individual  at 
the  end  of  200  generations,  and  is  an  average  of  five  runs. 

As  one  can  see,  adding  wildcards  to  the  system  had  little  effect  on  our  ability 
to  evolving  smaller  rule  sets.  A  wild  card  ratio  of  zero  is  equivalent  to  having 
no  wild  cards.  As  we  increase  the  wild  card  ratio  the  evolved  rule  lengths  are 
basically  unchanged  regardless  of  the  parsimony  pressure.  We  see  two  possible 
explanations  for  this.  First,  our  rule  matching  approach  differs  from  the  one  used 
in  classifier  systems.  We  allow  partial  matches,  and  that  acts  as  an  alternative, 
and  perhaps  competing  method  of  rule  generalization.  Another  more  likely  ex¬ 
planation  is  that  the  problem  domain  we’ve  chosen  is  just  too  simple.  We  believe 
that  wildcards  would  be  more  useful  if  given  a  more  difficult  problem. 

6  Conclusions  and  Future  Work 

We  have  completed  our  initial  design  and  evaluation  of  an  EA  designed  to  evolve 
behavioral  rules  for  teams  of  cooperating  agents.  Building  on  the  work  done  for 
single  agent  systems,  we  were  able  to  relatively  quickly  make  and  test  design 
choices  that  resulted  in  the  ability  to  evolve  effective  rule  sets  for  teams  of  homo¬ 
geneous  agents.  We  are  now  continuing  to  develop  the  system  further  in  several 
ways.  First,  we  believe  that  the  ability  to  evolve  shorter  and  more  general  rule 
sets  is  important.  Our  initial  experiments  with  wild  cards  were  not  successful. 
We  are  working  on  understanding  this  better. 

Our  ultimate  goal  is  to  work  toward  evolving  heterogeneous  cooperating 
agents.  This  initial  work  involving  homogeneous  provides  a  foundation  for  doing 
so,  but  needs  further  development.  Extending  our  system  to  include  notions  of 
cooperative  co-evolution  seem  quite  appropriate  here.  We  will  be  reporting  on 
this  in  the  near  future. 
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Abstract.  We  are  becoming  increasingly  dependent  on  large  intercon¬ 
nected  networks  for  the  control  of  our  resources.  One  important  issue  is 
resource  protection  strategies  in  the  event  of  failures  and/or  attacks.  To 
address  this  issue  we  investigated  the  effectiveness  of  evolving  finite-state 
machine  (FSM)  strategies  for  winning  against  an  adversary  in  a  challen¬ 
ging  Competition  for  Resources  simulation.  Although  preliminary  results 
were  promising,  unproductive  cyclic  behavior  lowered  performance.  We 
then  augmented  evolution  with  an  algorithm  that  rapidly  detects  and  re¬ 
moves  this  cyclic  behavior,  thereby  improving  performance  dramatically. 


1  Introduction 

We  are  becoming  increasingly  dependent  on  large  interconnected  networks  for 
the  control  of  our  resources,  such  as  the  Internet,  communications  networks,  and 
power  grids.  The  advantage  of  these  networks  is  the  ability  to  route  resources 
in  a  reasonably  optimal  fashion.  However,  their  interconnectivity,  coupled  with 
the  lack  of  global  view  of  what  is  happening  in  these  networks,  can  lead  to 
tremendous  problems  in  network  reliability.  For  example,  small  local  failures 
can  easily  propagate  to  entire  networks,  causing  loss  of  service  and  corruption  of 
data.  Also,  deliberate  attacks  (such  as  “denial  of  service”  [3]  attacks)  can  easily 
cause  widespread  havoc,  as  poignantly  demonstrated  recently  [11], 

Thus  one  important  issue  is  the  development  of  effective  network  traversal 
strategies  to  protect  as  many  resources  as  possible  from  failure  and/or  attacks, 
i.e.,  to  maximally  restrict  the  number  of  resources  damaged.  To  address  this 
issue  we  have  decided  to  create  a  “resource  protection”  simulation  that  captures 
the  essential  aspects  of  this  problem.  A  “defender”  attempts  to  protect  resources 
before  they  are  damaged  by  an  intentional  (or  unintentional)  “adversary”. 

Our  primary  goal  then  is  to  create  sophisticated  reactive  strategies  for  the 
defender.  We  use  finite-state  machines  (FSMs)  for  our  strategies,  since  there 
are  a  number  of  precedents  for  FSMs  being  effective  strategies  for  adversarial 
situations.  We  use  evolutionary  algorithms  (EAs)  to  create  the  FSMs,  since  there 
is  ample  evidence  for  the  effectiveness  of  this  approach  [4,5], 1  This  paper  serves 
to  summarize  and  highlight  the  results  we  have  obtained  thus  far. 

1  The  evolution  of  FSMs  is  often  referred  to  as  “evolutionary  programming”  or  “EP” . 
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2  The  Competition  for  Resources  Problem 

Our  current  Competition  for  Resources  simulation  is  a  novel  two-player  game 
on  a  toroidal  board  of  squares.  Each  square  corresponds  to  a  resource,  and 
the  two  players  (the  “defender”  and  “adversary”)  compete  for  squares  on  the 
board.  If  the  board  is  of  size  N  x  N,  then  the  defender  will  start  at  square 
(1,1)  and  the  adversary  will  start  at  square  (IV, AT).  The  remaining  squares  are 
initially  unoccupied.  Since  the  board  grid  represents  real  networks,  such  as  power 
grids  or  communication  networks,  and  in  the  real  world  networks  may  be  highly 
interconnected  and  will  have  few  geophysical  boundaries,  our  board  is  toroidal 
(has  no  edges).  In  this  paper  we  assume  N  =  10,  which  is  quite  challenging. 

Each  player  can  only  perceive  limited  information,  namely,  the  status  of  the 
north,  south,  east,  and  west  squares  neighboring  the  current  position  of  the 
player.  The  diagonal  squares  can  not  be  seen.  The  status  of  each  neighboring 
square  will  be  one  of  the  following:  unoccupied,  occupied  by  that  player,  or 
occupied  by  the  opponent.  Due  to  the  toroidal  nature  of  the  board,  the  defender 
and  the  adversary  are  close  to  one  another  at  the  beginning  of  the  game.  However, 
because  they  can  not  see  along  diagonals,  they  can’t  see  one  another  initially. 

Each  time  step  the  players  alternate  taking  an  action,  which  consists  of  mo¬ 
ving  to  a  neighboring  resource  to  control/protect  that  resource.  A  player  can 
move  to  an  unoccupied  square  or  back  to  a  square  that  it  has  previously  occu¬ 
pied,  but  not  to  a  square  occupied  by  the  opponent.  The  player  isn’t  allowed 
to  “stand  still”  and  make  no  move.  However,  because  each  player  must  follow  a 
path  of  “owned”  resources  to  its  current  position,  it  will  always  be  able  to  make  a 
move  at  every  time  step  (it  can  always  back  up  along  the  path  it  has  taken).  Thus 
a  player  can  not  be  “trapped”  at  a  square,  i.e.,  it  can  not  be  completely  surro¬ 
unded  by  the  opponent.  Once  an  agent  occupies  a  resource,  it  controls/protects 
that  resource  forever.  A  game  ends  when  all  squares  are  occupied  or  time  runs 
out.  The  agent  with  the  most  resources  at  the  end  of  the  game  wins. 

Throughout  this  paper  the  adversary  will  have  a  fixed  stochastic  strategy 
that  the  defender  must  “learn”  to  defeat.  The  strategy  we  have  chosen  for  the 
adversary  is  simple,  but  is  surprisingly  hard  to  beat.  If  the  adversary  detects 
any  unoccupied  neighboring  squares,  it  uniformly  randomly  moves  to  one  of 
them.  Otherwise  it  uniformly  randomly  backtracks  to  a  neighboring  square  it  has 
previously  occupied.  Given  the  game  and  our  adversary,  we  focus  on  developing 
effective  strategies  for  the  defender. 

3  Overview  of  Finite  State  Machines 

FSMs  can  be  effective  representations  of  agent  plans/strategies,  e.g.,  see  [1]  or 
[10].  The  type  of  machine  used  here  allows  for  indeterminate-length  action  se¬ 
quences.  Recall  from  Hopcroft  and  Ullman  [9]  that  the  usual  acceptance  criterion 
for  finite-length  strings  is  termination  in  a  “final”  state.  Here  we  assume  that 
there  are  no  final  states,  i.e.,  action  sequences  of  any  length  are  allowed.  This 
provides  a  good  model  of  embedded  agents  that  are  continually  responsive  to 
their  environment. 
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Formally,  we  define  the  machine  M  to  be  a  six-tuple  ( Q ,  E,  A,  8,A,qi).  Q  is 
the  set  of  vertices  (states)  of  M,  E  is  the  alphabet  of  input  symbols  (which  are 
agent  sensory  inputs) ,  and  A  is  the  alphabet  of  output  symbols  (which  are  agent 
actions).  8  is  the  transition  function  from  a  state  and  an  input  to  a  next  state, 
i.e.,  5(qi,Xi)  =  qi+ 1  where  qi,  Qi+ 1  £  Q  and  Xi  £  E  is  a  sensory  input.  A  is  the 
transition  function  from  a  state  and  an  input  to  an  output,  i.e.,  A (qi,Xi)  —  a, 
where  q \  g  Q,  r,  6  E,  and  a;  £  A  is  an  action.  The  initial  state  is  q\ . 

We  assume  that  the  FSMs  are  deterministic  and  complete.  The  FSMs  are 
deterministic  because  8  and  A  are  functions,  i.e.,  for  every  state  and  input  there 
is  a  unique  next  state  and  action.  The  FSMs  are  complete  because  there  exists 
a  next  state  and  action  for  every  state  and  input,  i.e.,  8  and  A  are  fully  defined. 
Deterministic  and  complete  FSMs  are  strategies  that  tell  the  agent  precisely 
what  to  do  in  every  situation  it  perceives. 

For  an  FSM  strategy  for  the  Competition  for  Resources  simulation,  the  sen¬ 
sory  input  Xi  shows  the  status  of  the  neighboring  resources  immediately  to  the 
north,  east,  south,  and  west  of  the  agent.  The  status  of  each  resource  can  be  0 
(unoccupied),  1  (occupied  by  the  defensive  agent),  or  2  (occupied  by  the  advers¬ 
ary).  Thus  an  input  of  “2100”  specifies  that  the  north  resource  is  owned  by  the 
adversary,  the  east  resource  is  owned  by  the  defensive  agent,  and  that  the  south 
and  west  resources  are  unoccupied. 


4  Evolution  of  Finite  State  Machines 

In  an  EA  a  population  of  P  individual  structures  is  initialized  and  then  evolved 
from  generation  t  to  generation  t  +  1  by  repeated  applications  of  fitness  evalua¬ 
tion,  selection,  recombination,  and  mutation.  In  the  context  of  the  Competition 
for  Resources  simulation,  each  individual  in  the  population  is  an  FSM.  Each 
FSM  is  evaluated  by  playing  the  game  numerous  times,  to  obtain  an  estimate 
of  how  well  that  FSM  is  defending  the  resources  against  the  adversary.  Those 
FSMs  that  perform  the  task  better  are  allowed  to  have  more  children,  which 
are  created  through  the  processes  of  mutation  and  recombination.  This  process 
continues  generation  by  generation,  until  termination. 

Representation.  For  efficiency  we  chose  a  simple  tabular  representation  for  the 
FSMs.  Rows  in  a  table  correspond  to  states,  and  columns  correspond  to  inputs. 
For  each  state  qi  and  input  Xj,  table  entry  (i,j)  has  two  elements.  The  first 
element  is  the  next  state,  i.e.,  it  is  8(qi,Xj).  The  second  element  is  the  action  to 
take  given  the  agent  is  in  this  state  q i  and  sees  input  Xj,  i.e.,  it  is  A (qi,Xj). 

The  number  of  states  S  is  user  defined.  The  maximum  number  of  inputs 
is  34  =  81,  since  the  status  of  each  of  the  four  neighboring  squares  may  have 
three  values  (0,  1,  and  2).  However,  the  input  “2222”  will  never  occur,  since  that 
implies  that  the  defender  is  surrounded  by  the  adversary.  This  is  impossible,  since 
the  defender  must  have  been  able  to  get  to  the  square  it  currently  occupies.  Thus 
there  are  80  possible  inputs  and  we  require  a  table  of  size  S  x  80.  Each  entry  in 
the  table  is  an  “allele”  that  represents  a  next-state/ action  pair.  Since  each  allele 
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is  defined  uniquely,  the  FSM  is  guaranteed  to  be  deterministic  and  complete. 
The  initial  state  is  always  state  1. 

Initialization.  Throughout  this  paper  a  population  size  of  P  =  100  is  assu¬ 
med,  since  it  produced  good  results.  Each  of  the  P  FSMs  at  generation  zero  is 
initialized  by  using  domain-specific  knowledge.  For  any  given  state  (row)  and 
input  (column)  the  next  state  is  chosen  uniformly  randomly  from  the  set  of  all 
S  states.  However,  the  choice  of  action  is  somewhat  more  complex.  The  number 
of  possible  actions  is  maximally  four,  since  the  defender  may  potentially  move 
north,  east,  south,  or  west.  However,  in  practice,  some  of  these  moves  might  be 
impossible,  if  the  adversary  owns  the  neighboring  squares. 

For  example,  suppose  again  that  the  input  is  “2100”.  In  this  case  the  north 
resource  is  owned  by  the  adversary,  and  there  are  only  three  legal  moves:  east, 
south,  and  west.  Moving  north  is  illegal,  since  the  adversary  owns  that  square. 
Actions  are  restricted  to  those  that  are  legal,  and  every  input  has  a  set  of  legal 
moves  that  are  possible.  However,  since  the  goal  of  the  game  is  to  capture  re¬ 
sources,  we  also  found  it  useful  to  define  preferred  moves  -  those  that  capture 
previously  unoccupied  squares.  When  the  input  is  “2100”,  moves  to  the  south  or 
west  capture  new  territory  and  are  thus  preferable.  During  initialization  actions 
are  always  chosen  uniformly  randomly  from  the  set  of  preferred  actions,  if  there 
are  any.  If  there  are  no  preferred  actions,  then  a  legal  action  is  randomly  chosen.2 
Adapting  the  Number  of  States.  Standard  methods  for  evolving  FSMs  adapt 
the  number  of  states  [4] .  State  adaptation  raises  a  number  of  issues.  When  a  state 
is  deleted,  should  one  really  erase  the  state  from  the  table,  or  should  it  simply 
be  made  inaccessible?  Our  prior  experience  in  similar  areas  has  shown  that  it 
is  often  best  to  make  the  information  inaccessible  [2],  Since  information  has 
been  learned,  keeping  the  information  stored  serves  as  a  useful  memory,  which 
can  be  re-activated  at  a  later  time  (if  the  state  is  added  back  to  the  FSM). 
Thus  we  added  a  “tag  bit”  to  each  row  of  the  FSM  table.  If  the  tag  is  1  the 
state  is  accessible.  If  the  tag  is  0  the  state  is  inaccessible,  but  is  not  destroyed. 
When  a  state  is  added,  it  is  accomplished  simply  by  turning  on  the  tag.  Tag 
bits  are  subject  to  an  independent  mutation  operation,  that  flips  the  tags  with 
probability  0.001.  Since  state  1  is  always  the  initial  state,  it  can  not  be  made 
inaccessible. 

Once  a  state  qj  has  been  made  inaccessible,  how  should  the  remainder  of  the 
FSM  (that  points  to  that  state)  be  “repaired”?  We  investigated  two  solutions: 

(1)  if  state  qi  points  to  qj,  change  the  pointer  so  that  it  points  back  to  qi  and 

(2)  change  the  pointer  to  point  to  any  state  that  is  accessible,  chosen  uniformly 
randomly.  The  latter  solution  performed  better. 

Adapting  FSM  Table  Entries.  Adaptation  of  the  FSM  table  entries  is  ac¬ 
complished  with  mutation  and  recombination.  Mutation  is  reasonably  straight¬ 
forward.  Each  allele  (next-state/action  pair)  in  the  FSM  is  chosen  with  proba¬ 
bility  pm.  Once  an  allele  is  chosen  a  coin  is  flipped  to  see  whether  the  action 
or  the  next  state  is  mutated.  With  probability  p  the  next  state  is  mutated  by 
uniformly  randomly  choosing  a  state  from  the  set  of  all  accessible  states.  With 

2  The  emphasis  on  preferred  actions  enormously  helps  the  initial  search  of  the  EA. 
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probability  1  —  p  the  action  is  mutated.  If  there  are  any  preferred  actions  the 
algorithm  uniformly  randomly  chooses  one  of  those.  If  there  are  no  preferred 
actions  the  algorithm  uniformly  chooses  any  legal  action.  Note  that  this  could 
result  in  no  change  (e.g.,  if  there  is  only  one  legal  action).  Experiments  indicated 
that  performance  was  remarkably  insensitive  to  pm  and  we  use  pm  =  0. 001/S' 
throughout  this  paper.  Setting  p  to  0.5  worked  well. 

We  also  use  P0  uniform  recombination  [12] .  A  proportion  pr  of  pairs  of  parents 
in  the  population  are  chosen  for  recombination.  For  each  pair  of  parents,  a  coin 
is  flipped  for  each  of  the  S  x  80  alleles.  The  allele  at  the  table  location  (i,j) 
in  the  first  FSM  is  swapped  with  the  corresponding  allele  in  the  second  FSM, 
with  probability  Pq.  If  alleles  are  swapped,  both  the  next  state  and  the  action 
are  swapped.  Since  only  corresponding  alleles  are  swapped,  there  is  no  need  to 
worry  about  possible  illegal  actions.  If  an  action  is  legal  for  one  FSM  at  location 
(i.j)  it  must  be  legal  for  any  other  FSM  at  location  since  the  input  j  is  the 
same.  Since  parents  may  have  different  sets  of  accessible  states,  recombination 
may  swap  alleles  in  such  a  fashion  that  a  next  state  that  was  accessible  in  one 
child  FSM  is  now  inaccessible  in  the  other  FSM.  In  this  situation  a  new  next 
state  is  chosen  uniformly  randomly  from  the  set  of  accessible  states  (in  that 
FSM).  Experiments  indicated  that  performance  was  very  sensitive  to  pr  and  Pq. 
Using  recombination  to  its  fullest  extent  ( pr  =  1.0  and  Pq  =  0.5)  worked  best. 
Fitness  Evaluation.  Since  the  adversary  in  the  Competition  for  Resources 
game  is  stochastic,  each  defender  FSM  will  have  to  play  the  game  multiple  times 
in  order  to  obtain  an  estimate  of  how  well  it  defends  the  resources.  Recall  that 
the  player  with  the  most  resources  at  the  end  of  the  game  wins.  In  case  of  a  tie, 
the  adversary  wins.  Given  G  games,  the  fitness  of  a  defender  FSM  is  the  fraction 
of  games  that  it  wins.  This  fitness  function  returns  values  from  0.0  to  1.0,  with 
1.0  representing  an  FSM  that  won  all  the  games  it  played.  Setting  G  properly 
proved  to  be  difficult.  Prior  work  [7]  concluded  that  the  overall  efficiency  of  the 
EA  may  often  be  improved  by  reducing  G  and  by  running  for  more  generations. 
This  did  not  work  for  us.  A  low  value  of  G  resulted  in  unacceptable  sampling 
error  and  a  high  value  was  too  CPU  intensive.  We  were  unable  to  balance  these 
constraints  with  an  intermediate  value. 

To  solve  this  difficulty  we  took  a  two-phase  approach.  Initially,  we  use  a  low 
value  of  G,  so  that  each  individual  can  get  a  quick  evaluation.  If  that  individual  is 
promising  (it  did  better  than  the  best  individual  seen  thus  far),  it  is  re-evaluated 
using  a  high  value  of  G.  If  it  still  beats  the  best  individual  thus  far,  it  becomes  the 
new  best  individual.  The  idea  was  to  carefully  evaluate  only  those  individuals 
that  appeared  promising.  This  approach  worked  quite  well.  We  used  a  value 
of  G  —  500  for  the  initial  evaluation  and  G  =  10,000  for  the  subsequent  re- 
evaluation  (if  it  was  performed).  Since  most  individuals  were  unable  to  beat  the 
best  individual  seen  thus  far,  they  were  not  re-evaluated. 

Selection  and  Termination.  We  use  standard  fitness-proportional  selection 
[8]  with  elitism  (i.e.,  the  population  contains  a  copy  of  the  best  individual  that 
has  ever  been  seen).  For  a  termination  criterion  we  ran  the  EA  for  a  user-defined 
number  of  generations  (2500). 
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5  Experimental  Evaluation 

We  performed  two  experiments  to  judge  the  efficacy  of  our  method.  We  were 
interested  in  answering  two  questions.  First,  how  many  states  should  be  acces¬ 
sible  initially?  Second,  does  the  adaptive-state  EA  find  the  optimal  range  of 
accessible  states?  To  address  the  first  question  we  ran  an  experiment  where  each 
FSM  individual  is  initialized  with  S  =  10  states.  The  experiment  consisted  of 
a  comparison  between  the  adaptive-state  EA  in  three  configurations:  one,  five, 
and  ten  initially  accessible  states.  The  only  mechanism  for  adapting  the  number 
of  accessible  states  is  via  the  independent  mutation  operation  mentioned  above, 
which  flips  the  accessibility  tags.  Although  we  have  no  “penalty”  function  per  se 
(that  would  penalize  the  FSMs  for  having  more  accessible  states),  the  mutation 
operator  provides  a  slight  bias  towards  having  S/2  accessible  states. 
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Fig.  1.  “Best-so-far”  curves  for  the  adaptive-state  EA  with  one,  five  and  ten  initially 
accessible  states. 
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Figure  1  shows  the  best-so-far  curves  (the  fitness  of  the  best  individual  seen 
thus  far)  for  the  adaptive-state  EA  initialized  with  one,  five  and  ten  accessible 
states.  The  log  plot  emphasizes  early  behavior.  Results  are  averaged  over  ten 
independent  runs  per  configuration.  On  average  the  adaptive-state  EA  changed 
the  number  of  accessible  states  in  the  FSMs,  from  one  to  five,  five  to  seven,  and 
ten  to  nine,  respectively.  One  can  see  that  having  fewer  initially  accessible  states 
helps  early  performance.3  This  is  intuitively  reasonable,  since  the  adaptive-state 
EA  is  initially  searching  a  smaller  space  in  this  situation.  However,  having  too  few 

3  The  difference  between  one  and  five  initially  accessible  states  is  statistically  sig¬ 
nificant  (p  <  0.04)  everywhere  except  between  7,000  and  18,000  evaluations.  The 
difference  between  five  and  ten  initially  accessible  states  is  significant  (p  <  0.04) 
between  2,000  and  19,000  evaluations.  The  data  may  not  be  normally  distributed  - 
hence  we  used  an  exact  Wilcoxon  rank-sum  test  with  paired  data  in  this  paper. 


172  W.M.  Spears  and  D.F.  Gordon 


initially  accessible  states  (e.g.,  one)  hurts  later  performance.  With  five  and  ten 
initially  accessible  states  performance  was  quite  reasonable,  with  a  final  fitness 
of  0.898  and  0.899  respectively.  These  results  indicate  that  if  the  best  number  of 
states  for  solving  a  problem  is  not  known  a  priori  it  may  be  best  to  err  on  the 
side  of  having  too  many,  rather  than  too  few. 

The  second  question  (whether  the  adaptive-state  EA  finds  the  optimal  range 
of  accessible  states)  is  also  important,  although  we  have  not  seen  it  addressed  in 
the  literature.  To  address  it  we  ran  a  control  (ablation)  experiment,  where  we 
turned  off  the  adaptation  of  the  number  of  states.  Instead,  the  EA  was  run  with 
a  fixed  number  of  states  S.  There  were  ten  configurations  ( S  ranged  from  one 
to  ten)  and  ten  independent  runs  per  configuration.  The  results  are  shown  in 
Table  1.  Two  points  are  clear.  The  first  is  that  for  best  performance  this  problem 
requires  FSMs  with  at  least  three  states  (i.e.,  state  information  is  useful).  The 
second  is  that  performance  is  fairly  comparable  in  the  range  of  three  to  ten 
states.4  This  agrees  with  the  previous  experiment  (with  the  adaptive  number  of 
states),  which  always  ended  in  the  range  of  five  to  nine  accessible  states. 


Table  1.  The  final  fitness  of  the  best  individuals  for  each  configuration,  averaged  over 
ten  runs  per  configuration.  The  optimum  number  of  states  is  between  three  and  ten. 


Fixed  Number  of  States 

123456789  10 

Fitness 

0.806  0.867  0.893  0.888  0.901  0.903  0.899  0.905  0.896  0.883 

In  summary,  the  adaptive-state  EA  effectively  converges  to  the  optimal  range 
of  states.  Furthermore,  its  performance  is  competitive  with  the  best  fixed-state 
results  (when  started  with  five  or  ten  initially  accessible  states).  The  only  cause 
for  concern  is  that  although  the  adaptive-state  EA  that  started  with  one  initially 
accessible  state  ended  with  roughly  five  accessible  states,  the  end  performance 
was  much  poorer  (0.841)  than  the  fixed-state  results  (0.901).  This  suggests  that 
although  states  are  being  made  accessible  the  FSM  is  not  taking  full  advantage 
of  them.  We  investigate  this  possibility  further  in  Section  7. 

6  External  Behavior  of  the  Evolved  FSMs 

In  order  to  understand  and  improve  our  results,  we  watched  the  agents  play 
the  simulation.  The  most  noticeable  feature  was  unproductive  cycling  behavior 
by  the  defender.  In  other  words,  the  defender  repeatedly  visits  a  small  set  of 
squares  on  the  board,  while  the  adversary  continues  capturing  new  squares. 
Unfortunately,  evolution  alone  can  not  solve  this  problem  because  cycles  are 
inherent  to  FSMs.  Therefore  we  augmented  the  FSMs  with  an  auxiliary  memory 

4  The  increase  in  performance  from  one  to  two  states  and  two  to  three  states  is  signifi¬ 
cant  ( p  <  0.003  and  p  <  0.04,  resp.)  whereas  the  other  differences  are  not  significant. 
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and  an  algorithm  to  use  this  memory  to  detect  and  eliminate  cycles  (up  to 
a  user-defined  maximum  length).  To  detect  cycles,  we  use  behavior  checking 
(earlier  results  with  model  checking  are  described  in  [13]).  Behavior  checking 
examines  the  dynamic  run-time  behavior  of  the  agent.  Run-time  checking  of 
system  behavior  is  a  very  new  topic  in  the  verification  community,  but  some 
of  the  results  already  appear  promising  (e.g.,  [6]).  Here,  we  present  the  first 
algorithm  of  which  we  are  aware  that  does  a  run-time  check  for  an  FSM  agent’s 
cyclic  behavior. 


Adaptive  Number  of  States  with  Repair 


Fig.  2.  A  comparison  of  performance  with  and  without  cycle  checking  and  repair. 


Our  behavior  checking  algorithm  is  executed  while  the  agents  play  a  game. 
For  a  sliding  window  of  t  time  steps,  the  defender  saves  its  current  state  and  loca¬ 
tion  on  the  board  (the  defender  now  consists  of  an  FSM  and  auxiliary  memory). 
The  defender  uses  this  auxiliary  memory  to  make  a  cycle  check  before  every 
move.  If  all  four  immediate  neighbors  are  occupied,  then  the  defender  checks 
whether  its  current  state  and  location  are  equal  to  any  other  in  its  window.  If 
yes,  a  cycle  has  been  identified  and  a  random  alternative  action  is  taken  to  the 
one  recommended  by  the  FSM.5  We  have  found  that  a  window  size  of  2  x  N  =  20 
time  steps,  which  identifies  cycles  up  to  length  20,  works  well. 

To  test  the  hypothesis  that  behavior  checking  and  cycle  repair  will  improve 
performance,  we  reran  the  adaptive-state  EA  experiment  with  five  and  ten  initi¬ 
ally  accessible  states,  but  added  in  cycle  detection  and  repair.  This  hypothesis 
is  confirmed,  as  shown  in  Figure  2. 6  Our  best  performing  defender  FSM  with 
repair  wins  96%  of  the  games! 


5  Of  course,  this  alternative  action  could  also  create  a  cycle,  but  the  behavior  checking 
algorithm  will  immediately  detect  that  cycle,  after  that  move. 

6  The  improvement  using  detection  and  repair  is  statistically  significant,  p  <  0.01. 
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7  Internal  Behavior  of  the  Evolved  FSMs 

Although  we  achieved  excellent  performance  with  the  addition  of  cycle  detec¬ 
tion  and  repair,  we  were  still  concerned  that  the  adaptive-state  EA  might  not  be 
making  good  use  of  newly  accessible  states.  To  investigate  this  further  we  per¬ 
formed  a  dynamic  internal  analysis  of  the  FSMs  as  they  were  executed  by  the 
defender.  While  the  FSM  was  executing  we  counted  the  number  of  times  that 
each  of  the  n  <  10  accessible  states  was  actually  the  next  state  of  a  transition. 
The  results  for  the  fixed-state  experiment  were  reassuring  -  the  FSM  tended  to 
make  reasonably  uniform  use  of  all  n  states.  However,  this  was  not  true  for  the 
adaptive-state  experiment.  As  stated  earlier,  on  average  the  adaptive-state  EA 
changed  the  number  of  accessible  states  in  the  FSMs  from  one  to  five,  five  to 
seven,  and  ten  to  nine,  respectively.  However,  the  internal  analysis  shows  that 
only  two,  four,  and  six  states  (respectively)  are  actually  being  used  to  any  ap¬ 
preciable  degree.  This  explains  the  poor  performance  of  the  configuration  where 
only  one  state  is  initially  accessible.  Although  five  states  became  accessible,  only 
two  were  actually  being  used.  Clearly  the  adaptive-state  EA  is  having  difficulty 
making  full  use  of  newly  accessible  states  that  have  never  been  seen  before. 

This  raises  the  serious  concern  that  the  addition  and  deletion  of  states  are 
such  disruptive  operations  that  they  cause  noticeable  problems  for  the  evolution 
of  FSMs.  Currently  we  are  investigating  the  application  of  “gentler”  operators 
that  could  perform  the  same  role.  Simply  deleting  states  (or  turning  them  off)  is 
too  disruptive,  due  to  the  repair  that  must  be  performed  afterwards.  However, 
merging  two  similar  states  could  remove  a  state  in  a  fashion  less  deleterious 
to  evolution.  This  process  would  be  analogous  to  generalization.  Similarly,  as 
opposed  to  adding  states  (or  turning  them  on),  an  alternative  operator  would 
clone  an  existing  row  of  the  tabular  representation.  Accessing  this  new  state 
would  not  be  deleterious  and  evolution  could  proceed  to  modify  it  slowly.  This 
provides  a  process  of  specialization.  We  are  currently  exploring  these  options. 

8  Summary  and  Future  Work 

To  summarize,  this  paper  has  empirically  explored  issues  related  to  evolving 
FSMs  in  the  context  of  the  Competition  for  Resources  problem.  Our  experiments 
yielded  some  interesting  and  useful  results.  For  example,  given  enough  initially 
accessible  states,  it  was  encouraging  to  find  that  the  adaptive-state  EA  was  able 
to  successfully  converge  to  the  optimal  range  of  number  of  states  and  was  able  to 
provide  good  performance.  However,  problems  arise  when  starting  with  too  few 
initially  accessible  states,  and  an  analysis  indicates  that  (for  the  Competition 
for  Resources  problem  at  least)  the  adaptive-state  EA  is  having  some  difficulty 
making  good  use  of  newly  accessible  states.  We  also  found  that  the  ubiquitous 
presence  of  cycles  hampered  the  defender’s  performance  significantly.  This  latter 
difficulty  was  greatly  diminished  by  augmenting  the  FSMs  with  memory  and  an 
algorithm  for  cycle  detection  and  repair. 

Our  main  focus  for  the  future  is  to  improve  the  evolution  of  the  FSMs, 
to  make  the  Competition  for  Resources  game  more  realistic,  and  to  continue 
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our  empirical  investigations  in  the  context  of  newer  versions  of  the  game.  For 
example,  in  the  current  game  resources  are  all  treated  equally.  In  the  spirit 
of  game  theory,  we  would  like  to  consider  resources  having  different  numeric 
values,  and  perhaps  have  the  value  of  a  resource  differ  for  each  of  the  agents. 
Another  possibility  is  to  allow  one  agent  to  (with  some  small  probability)  “steal” 
a  resource  owned  by  the  other  agent.  Another  possibility  is  to  include  multiple 
agents  and  co-evolution.  What  is  most  interesting  about  this  game  is  how  easily 
it  can  be  changed  to  represent  a  wide  variety  of  problems.  For  example,  with 
minor  modifications  we  have  extended  the  game  to  represent  the  epidemiology 
of  virus  versus  anti-virus  spread.  In  the  virus  version  of  the  game,  each  square 
represents  an  agent  with  the  virus,  anti-virus,  or  neither.  At  each  time  step, 
an  agent  having  the  virus  or  anti-virus  can  spread  it  to  one  of  its  neighbors. 
What  one  sees  on  the  board  when  watching  this  version  of  the  game  looks  like  a 
“spreading  activation”.  Further  pursuit  of  the  virus  version  both  in  simulation 
and  in  a  corresponding  mathematical  model  are  currently  in  progress. 
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Abstract.  This  paper  discusses  a  method  of  translating  human  activi¬ 
ties  into  a  program.  The  final  goal  of  this  research  is  to  develop  an 
automatic  programming  system  which  can  be  used  easily.  A  new  mo¬ 
deling  scheme  is  introduced  to  allow  human-like  representation  and  to 
replace  the  subject  of  programming  from  person  to  computer.  A  me¬ 
thod  of  translating  rules  described  based  on  this  modeling  scheme  into 
program  specification  is  proposed.  By  using  domain  which  is  defined  to 
variables  of  rules,  the  optimum  program  specification  can  be  generated. 


1  Introduction 

The  goal  of  this  paper  is  to  discuss  a  part  of  research  conducted  by  the  author’s 
group.  The  final  goal  of  this  research  is  to  develop  a  way  of  automatic  program¬ 
ming.  Programming  is  a  special  activity  by  human  being.  Therefore  a  concept  of 
activity  used  in  this  paper  is  discussed  first.  Every  activity  has  its  outcome  and 
is  executed  by  some  subject.  The  subject  can  be  a  person  or  the  other  creatures 
or  a  machine.  Each  subject  has  its  own  language  and  tool  for  representing  and 
promoting  an  activity.  Accordingly  every  subject  has  its  own  way  of  promoting 
activity.  Subject’s  activity  is  also  affected  by  environment  in  which  the  subject 
is  put.  Therefore  there  can  be  different  activities  with  the  same  outcome  by  the 
different  subject  and  environment.  In  order  for  someone  to  watch  and  describe 
the  activity  by  some  others  correctly,  it  is  necessary  to  include  the  subject  of  the 
activity  and  its  environment  in  its  representation. 

When  the  subject  is  a  computer,  a  formal  description  of  its  activity  is  a  com¬ 
puter  program.  Programming  is  an  activity  to  represent  formally  an  activity  of 
a  computer  at  the  higher  level.  Subject  of  this  activity  (programming)  has  been 
mostly  human  being.  To  automate  programming  is  to  replace  the  subject  of  the 
activity  from  person  to  computer.  In  order  to  achieve  this  goal,  it  is  necessary  to 
represent  formally  the  activity  of  programming  of  which  the  subject  is  a  compu¬ 
ter.  Thus  the  objective  of  this  paper  is  to  describe  a  way  of  representing  a  goal 
directed  activity  by  human  being  and  translating  it  into  an  activity  by  computer, 
i.e.  program.  This  goal  is  divided  into  two  sub-goals;  creating  a  representation 
of  human  activity  and  translating  it  into  computer’s  activity.  The  range  of  re¬ 
presentation  of  human  activity  is  very  wide.  Among  all,  such  a  representation 
that  is  very  near  to  computer  program  so  that  almost  formal  transformation  into 
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computer  program  is  possible  is  the  ordinary  program  specification.  This  form  is 
far  from  natural  representation  of  human  activity  and  the  one  is  required  to  take 
programming  technique  into  account  in  the  representation.  The  objective  of  this 
research  therefore  is  to  allow  ones  human-like  representation  on  human  activity 
and  also  to  discuss  a  computer  technology  that  can  translate  it  into  computer’s 
activity.  This  paper  discusses  a  method  of  translating  human  activity  to  pro¬ 
gram  code  in  this  vast  area.  It  is  assumed  that  description  of  human  activity  has 
already  been  given.  To  make  it  is  another  problem. 


2  New  Modeling  Scheme-Model  Representation 
Including  Subjects  and  Objects 

Every  significant  computer  activity  has  some  object  to  which  the  activity  ap¬ 
plies.  Therefore  the  scope  of  consideration  is  limited  to  this  class.  That  is,  activity 
without  any  object  is  out  of  consideration.  A  new  modeling  scheme  is  developed. 
It  is  to  represent  every  activity  with  their  related  objects  that  concern  achieving 
the  goal.  It  includes  two  kinds  of  structures;  object  model  structure  and  subject 
model  structure.  The  former  is  a  structure  of  objects  that  are  included  in  some 
activity  to  be  executed  by  some  subject.  Objects  are  organized  in  a  model  struc¬ 
ture  by  means  of  finite  structural  relations.  Typical  examples  are  is-a  relation 
and  part-of  relation.  In  order  to  represent  human  activity  in  its  natural  form  for 
human  being,  these  are  not  enough  but  the  others  are  necessary.  For  example, 
it  is  difficult  to  represent  a  power  set  with  these  structural  relations.  The  power 
set  concept  is  sometimes  important  to  represent  human  idea  and,  accordingly, 
human  activity.  Cartesian  products,  list  and  graph  are  also  necessary.  Thus  a 
least  set  of  structural  relations  must  be  defined  to  represent  these  structures. 
Using  these  basic  structural  relations  an  object  structure  is  formed.  Usually  it 
is  a  hierarchy  either  or  both  of  is-a  and  pat-of  relation  and  graph  in  the  same 
level  of  the  hierarchy, 

Subjects  are  also  organized  in  a  model  structure  by  means  of  a  structural 
relation.  In  reality  there  is  no  substantial  structural  relation  as  is  the  case  of 
object  structure.  Rather  subjects  exist  independently  to  each  other.  Every  sub¬ 
ject  has  it’s  own  activity.  It  is  represented  by  a  predicate.  It  includes  a  subject 
as  a  term.  There  is  some  relation  between  activities.  An  activity  A  may  depend 
on  another  activity  B  in  the  sense  A  uses  the  outcome  of  B.  In  this  case  these 
activities,  and  therefore  their  subjects,  are  arranged  in  a  hierarchy  such  that  the 
activity  B  is  put  under  the  activity  A.  When  activities  A  and  B  have  mutual 
dependency  they  form  a  recursive  relation.  In  this  case  the  activities  and  their 
subjects  are  put  on  the  same  level.  Thus  subjects  are  organized  in  a  structure 
via  the  relation  of  their  activities. 

There  are  two  types  of  activities.  The  first  class  activities  are  those  arranged 
at  the  top  or  middle  of  the  subject  model  hierarchy.  Every  activity  in  this  class 
has  some  dependency  relation  to  the  other  activity.  The  second  class  activities 
are  those  arranged  at  the  leaves  of  the  subject  model  hierarchy  and  have  no 
dependency  relation  to  the  other  activity.  Every  activity  can  be  executed  inde- 
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pendently  to  the  other  activity  referring  only  to  the  object  structure.  Thus  the 
global  goal  of  this  system  can  be  achieved  by  executing  these  activities  from  the 
bottom.  This  is  a  case  of  human  execution  of  the  activities.  Automatic  program¬ 
ming  is  to  translate  this  structure  of  activities  into  a  program.  In  this  case  the 
order  of  activity  execution  in  computer  must  be  decided.  The  first  class  activi¬ 
ties  concern  the  control  structure  of  the  produced  program.  In  order  to  decide 
this  control  structure,  top-down  interpretation  of  the  activity  structure  is  ne¬ 
cessary.  Each  activity  at  the  leaf  represents  a  program  unit  that  is  fabricated 
in  the  control  structure.  This  unit  program  is  not  always  very  small.  In  order 
to  make  up  whole  program  the  automatic  programming  of  the  unit  program 
must  be  assured.  Thus  automatic  programming  consists  of  two  stages;  automa¬ 
tic  programming  of  unit  program  and  development  of  control  structure.  In  the 
following  the  automatic  programming  of  unit  program  is  discussed. 

3  KAUS  as  Representation  Language 

A  language  suited  for  representing  this  system  is  necessary.  In  order  to  cope 
with  problem  model  of  the  form  as  discussed  in  section  2,  it  must  be  suited  for 
representing  predicate  including  data-structure  as  argument  and  also  for  descri¬ 
bing  meta-level  operation  such  as  knowledge  for  describing  on  other  knowledge. 
KAUS  (Knowledge  Acquisition  and  Utilization  Language)  has  been  developed 
for  the  purpose.  In  the  following,  some  logical  expressions  appear  as  knowledge. 
In  order  to  keep  consistency  and  integrity  of  expressions  throughout  the  whole 
system  these  must  be  written  in  KAUS  language.  But  these  are  not  necessarily 
written  in  correct  KAUS  expressions  but  locally  simplified.  It  is  because  KAUS 
syntax  is  not  included  in  this  volume  and  also  these  locally  simplified  expressions 
are  more  comprehensive  than  correct  expressions. 

4  Automatic  Programming  for  Unit  Program 

Problem  formulation  of  automatic  programming  of  unit  program  is  as  follow.  A 
unit  activity  at  the  lowest  level  (leaf)  of  activity  hierarchy  is  presented.  It  beco¬ 
mes  a  query  to  a  programming  system  to  be  solved  and  the  obtained  procedure 
for  solving  this  problem  is  translated  into  a  program.  The  system  contains  a  kno¬ 
wledge  base  and  inference  engine  to  solve  problem  automatically.  The  activity  as 
query  includes  variables  with  domains  to  represent  input  and  output  variables 
of  program  to  be  generated.  The  programming  system  is  required  to  solve  the 
query  as  problem  for  all  possible  cases  in  the  variables  of  designated  domains. 
This  programming  process  is  composed  of  two  stages;  generating  a  specification- 
tree  to  represent  the  program  specification  and  converting  this  specification-tree 
to  object  code.  The  specification-tree  is  represented  as  shown  in  Fig.  1. 

Every  activity  is  represented  in  the  form  of  logical  predicate  in  this  tree.  The 
tree  has  predicate  nodes  and  rule  nodes,  and  these  two  kinds  of  nodes  appear 
alternately.  Variable  type,  either  input  type  or  output  type,  is  recorded  in  every 
predicate-node.  Domain  of  variables  is  recorded  in  rule-node.  This  domain  can 
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Fig.  1.  specification- tree 


be  narrowed  while  problem  solving  as  will  be  shown  later  on.  More  than  one 
specification-tree  is  generated  depending  on  the  situation  to  represent  different 
function  and  each  specification-tree  is  transformed  into  program  source  code  as 
a  function  call.  In  each  tree,  the  function  name  and  its  process  are  declared  in 
the  top  node  at  beginning,  and  the  top  node  representing  a  function  is  expanded 
into  child  nodes  to  represent  its  detailed.  Nodes  inside  the  tree  show  a  logical 
structures  and  input-output  relation  of  processes.  Every  leaf  nodes  represents 
a  primitive  process,  to  which  a  program  is  prepared.  With  KAUS,  leaf  node 
is  either  fact  predicate  or  PTA  predicate.,  which  process  is  defined  in  proce¬ 
dural  language.  When  a  specification-tree  is  generated  and  is  transformed  into 
program,  the  nodes  inside  tree  are  ignored  but  only  leaf  nodes  are  encoded  on 
the  basis  of  depth-first-search  because  these  inner  nodes  are  generated  as  inter¬ 
mediate  products  to  get  the  correct  set  of  leaf  nodes.  A  node  of  which  lower 
nodes  are  connected  by  ”or”  connective  is  transformed  into  a  branch.  Once  a 
specification-trees  is  generated,  it  can  be  transformed  into  the  program  of  any 
target  language  according  to  the  transformation  rule.  Thus  for  automatic  gene¬ 
ration  of  program,  the  method  of  generating  specification-tree  is  important.  It 
is  discussed  in  detail  first. 

4.1  Specification- Tree  Generation 

The  basic  idea  of  automatic  programming  in  this  paper  is  based  on  automatic 
problem  solving.  Backward  reasoning  is  achieved  by  the  inference  mechanism 
of  KAUS.  A  succeeded  path  of  problem  solving  in  the  trace  of  the  process  is 
remained  by  deleting  backtracked  paths  and  is  represented  as  a  tree  of  rules 
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used  there.  In  general  it  is  difficult  to  solve  problems  including  variables  and 
generate  the  trees  directly  because  depending  on  the  domains  of  mutually  related 
variables  the  different  trees  can  be  generated. 

Therefore  the  tree  is  generated  in  two  steps.  First  the  system  chooses  a 
specific  value  for  every  variable  from  its  domain,  solves  this  instance  problem 
and  generates  an  instance  tree.  The  tree  can  be  different  by  the  selection  of 
instances  for  the  variables.  It  shows  a  possible  pattern  of  the  tree. 

Then,  as  the  second  step,  generalize  this  tree  by  recovering  a  variable  to 
each  instance.  There  are  two  approaches  for  generalization.  One  is  to  apply  the 
instance  problem  solving  repeatedly  by  selecting  a  different  set  of  instances  each 
time  until  no  more  new  pattern  of  the  tree  is  generated.  Then  by  merging  every 
different  pattern  into  a  tree,  a  general  structure  of  the  tree  for  the  query  including 
variables  is  generated. 

With  this  method  selecting  sufficient  queries  is  necessary  to  get  the  whole 
patterns  of  trees  to  satisfy  all  values  in  input  domain.  The  authors  adopted  first 
this  method.  But  here  is  a  difficulty  in  case  input  domain  is  continuous  or  such 
as  the  whole  integers.  Even  in  this  case  the  number  of  possible  patterns  can  be 
finite.  But  there  is  no  guideline  for  an  efficient  section  of  the  instance  sets. 

The  second  approach  is  to  use  the  domain  information  of  variables  after  an 
instance  problem,  or  a  few  instance  problems  if  necessary,  was  solved  and  an 
instance  tree  was  generated.  In  KAUS  expression,  domain  can  be  included  in 
every  predicate  like  ‘(Ax/domainl)(Ay/domain2)  predicatel(x,  y):-  predicate2 
(x),  predicate3’.  By  using  this  domain  information,  it  becomes  possible  to  make 
the  optimized  program  specification. 

The  forrowing  rule  shows  how  to  generate  specification-tree. 

makeSpecif icationTree (+Subject ,  +Query,  -Specif icationTree) 
makeNewInstanceQueryf+Query ,  -InitialQuery) , 
getlnstanceTreef+Subject,  +InitialQuery,  -InstanceTree), 
analyzelnstanceTreef+Query,  +InstanceTree ,  -AnalyzedTree), 
modifyTreef+InitialTree,  -Modif iedTree) , 
generalizeTreef+Subject ,  +Modif iedTree,  -GeneralTree) , 

Opt imizeTree (+GeneralTree ,  -Opt imizedTree) , 
identifyConditionNodesf+OptimizedTree,  -Specif icationTree) . 

generalizeTreef+Subject,  +CurrentTree ,  -GeneralTree)  :- 
makeNewQueryf +CurrentTree ,  -Query) , 
makeNewInstanceQueryf+Query ,  -InstanceQuery) , 
getlnstanceTreef+Subject,  +InstanceQuery ,  -InstanceTree), 
analyzeInstanceTree(+Query ,  +InstanceTree,  -AnalyzedTree), 
mergeTreef+CurrentTree ,  +AnalyzedTree ,  -MergedTree) , 
modifyTreef+MergedTree,  -Modif iedTree) , 
generalizeTreef+Subject,  +Modif iedTree ,  -GeneralTree). 

generalizeTree(+_,  +GeneralTree ,  -GeneralTree). 
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For  readability,  the  form  of  this  rule  is  modified  from  KAUS  form.  In  this 
expression,  ”+”  mark  given  ahead  to  value  means  that  the  value  is  an  argument, 
and  means  return  value. 

4.2  An  Example  of  Specification- Tree  Generation 

The  specification-tree  generation  process  is  explained  here  with  an  example. 
Whole  processes  are  shown  in  Fig. 2.  An  operation  to  generate  a  specification-tree 
is  started  by  the  predicate  makeSpecificationTree.  makeN ewInstanceQuery 
select  a  specific  value  from  the  domain  RO  in  the  query  qO.  And,  new  query 
N ewInstanceQuery  is  made  with  the  specific  value  selected.  getlnstanceTree  is 
called  by  value  N ewInstanceQuery,  and,  system  actually  solve  problem.  Tracing 
the  problem  solving  process,  getlnstanceTree  make  an  InstanceTree.  Then 
analyzelnstanceTree  analyzes  the  input-output  type  based  on  InstanceTree 
and  traced  domain  of  each  variable  of  all  nodes  in  the  InstanceTree.  modify 
Tree  modifies  the  structure  of  AnalyzedTree.  The  tree  ModifiedTree  is  the 
initial  state  of  specification-tree. 

By  generalizing  this  tree  GeneralTree  to  represent  a  general  processing  struc¬ 
ture  is  obtained.  In  generalizeTree,  makeN ewQuery  and  makeN ew Instance 
Query  search  such  nodes  that  are  possible  to  select  not-yet  used  rule  in  the  tree 
being  generalized,  and  generate  a  query  to  select  not-yet  used  rules.  For  example, 
the  node  of  pl(x)  has  possibility  of  being  expanded  by  the  rule  r2  other  than 
rl.  Since  £  €  RO  in  pl(x)  and  RO  n  R3  A  0,  it  is  possible  to  generate  a  query 
‘pi (3)?’  that  expands  r2.  Using  this  query,  it  is  possible  to  obtain  an  analyzed 
tree  in  the  same  way  as  that  used  for  obtaining  an  initial  tree.  Then  mergeTree 
and  modifyTree  merge  the  tree  being  generalized  and  this  new  analyzed  tree. 
This  generalization  procedure  is  repeated  until  new  query  is  not  generated  any 
more  by  makeN ewQuery .  In  this  example,  generalization  by  a  query  lp2(5)?  ’ 
is  achieved  again  and  the  generalization  terminates.  By  identifying  the  conditi¬ 
ons  of  optimization  and  branching  in  the  tree  thus  obtained,  a  specification-tree 
(Specif icationTree)  is  obtained. 

Note  that  the  rules  that  are  not  used  like  r3  do  not  appear  in  the  specification- 
tree.  As  is  shown,  selection  of  necessary  and  sufficient  rules  becomes  possible  by 
tracing  domain  of  variables.  Accordingly,  general  rules  can  be  written  without 
taking  notice  on  each  case  of  using  to  specific  problem. 

4.3  Domain  Tracing 

Domain  tracing  is  a  most  important  process  in  specification-tree  generation. 
To  put  necessary  and  sufficient  rules  into  specification-tree  becomes  possible  by 
tracing  and  narrowing  domain  of  input-output  values.  The  domain  of  arguments 
is  traced  from  top  of  the  tree,  and  the  domain  of  return  value  is  traced  from 
the  bottom  of  the  tree.  Fig.  3  shows  the  example  when  query  'pl(x,y)T  is  given 
with  x  as  argument  and  y  as  return  value.  The  domain  of  x  is  traced  from  the 
top  and  the  domain  of  y  is  traced  from  the  bottom.  In  r3,  y  is  function  of  x, 
therefore  the  domain  of  y  (R7')  have  to  be  calculated  according  to  the  domain 
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Fig.  2.  specification-tree  generation 


of  x.  If  calculation  of  all  of  the  primitive  predicates  is  possible,  the  domain  is 
traced  precisely.  In  many  cases,  however,  this  calculation  is  not  possible.  In  this 
case,  the  domain  to  which  calculation  is  not  possible  is  set  to  Univ  to  mean  a 
universe.  A  larger  domain  causes  selecting  unnecessary  rules.  However  in  many 
cases  it  does  not  matter  even  if  the  domain  is  set  to  Univ.  In  case  of  narrow 
domain  is  necessary  in  the  following  process  in  order  to  select  rules  exhaustively, 
however,  larger  domain  causes  a  problem  of  rule  selection. 

There  are  some  cases  in  which  possibility  of  programming  can  be  revealed 
during  domain  tracing.  For  example,  let  a  predicate  pl  of  which  the  domain  of  an 
attribute  x  is  Rl  be  expanded  and  it  is  revealed  that  pl  is  to  be  processed  either 
by  p2  or  p3  of  which  the  domain  of  the  argument  x  is  R2  or  R3  respectively. 
If  R1dR2UR3.  some  inputs  for  pl  in  Rl  cannot  be  processed  only  by  R2  and 
I?3.  Then  programming  becomes  impossible  with  existing  rules.  If  there  are  such 
nodes  other  than  those  nodes  that  are  to  be  used  for  judging  the  conditions  as 
will  be  discussed  below,  the  system  has  to  tell  user  that  the  necessary  rule  for 
programming  is  not  enough  and  suuply  the  lacked  knowledge. 

When  specification-tree  has  a  recursive  structure  including  loop,  the  domain 
of  the  variable  that  is  transmitted  inside  loop  change  in  every  loop  operation. 
Since  it  is  difficult  to  trace  the  domain  that  changes  in  every  cycle,  the  domain  is 
made  Univ.  Those  variables  that  are  not  changed  in  the  loop  operation  are  given 
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the  domains  that  are  the  domains  at  just  before  entering  the  loop.  It  is  necessary 
here  to  investigate  the  method  of  tracing  the  domain  that  is  transmitted  inside 
loop  as  far  as  possible,  but  in  many  case  this  is  not  urgent  because  putting  it  as 
Univ  does  no  bring  large  defect. 

4.4  Structure  Extraction 

Different  from  procedural  program,  an  activity  defined  in  the  form  of  predicate 
does  not  have  information  on  process  sequences.  In  order  to  get  a  process  sequen¬ 
ces,  let  system  solve  a  problem,  and  get  the  specification-tree  which  is  equivalent 
to  the  process  system  performed.  Before  structure  extraction,  a  specification-tree 
generated  in  this  way  may  have  many  duplicate  or  overlapping  sub-trees.  These 
sub- trees  are  extracted  from  the  main  tree  as  an  independent  function.  Recursive 
process  and  loop  process  has  a  sub-tree  of  the  same  form,  which  appears  repea¬ 
tedly  increasing  the  depth  in  the  tree  with  the  number  of  repetition.  A  special 
treaty  is  necessary  in  order  to  prevent  the  tree  becoming  of  infinite  depth. 

This  sub-tree  is  extracted  and  processed  separately.  This  is  achieved  by  cut¬ 
ting  this  sub-tree  as  a  separate  specification-tree.  Only  a  calling  method  to  this 
extracted  part  is  left  to  the  original  tree.  After  then  every  specification-tree  that 
has  the  same  top  node  is  merged  into  a  single  tree.  After  then  there  is  no  such 
sub-trees  that  have  the  same  predicate  node  in  the  same  tree  and  let  the  number 
of  specification-trees  becomes  at  most  equal  to  the  number  of  rules. 

As  an  example  of  sub-structure  extraction,  a  case  including  a  loop  is  shown 
in  Fig.  4.  This  example  shows  that  a  structure  is  to  be  modified  by  adding  an 
analyzed  instance  tree  T 3  to  a  tree  being  composed  of  T1  and  T 2  and  being 
generalized.  The  node  surrounded  by  a  block  in  this  figure  shows  such  a  node 
that  appears  plural  times  by  adding  T3.  By  cutting  these  nodes  from  the  end  it 
becomes  the  state  of  (b).  Then  the  state  (c)  is  the  final  state  that  is  obtained  by 
merging  the  p2  that  is  selected  by  the  condition  of  the  argument  by  disjunction. 
When  two  specification-tree  that  have  the  predicate  p2  as  a  top  node  is  merged, 
every  domain  for  the  variables  in  merged  specification-tree  is  calculated  again 
and  modified. 

Here  is  a  problem  of  defining  specification-tree  having  the  same  node  as  a 
top.  The  condition  that  it  has  the  same  predicate  with  the  same  number  of 
arguments  and  the  same  input-output  relation  is  mandatory.  As  well  depen¬ 
ding  on  whether  the  equality  of  the  domains  of  the  input-output  variables  that 
were  obtained  by  tracing  is  added  to  the  condition  or  not,  the  generated  speci¬ 
fication  can  be  different.  When  two  specification-trees  with  the  different  input 
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Fig.  4.  program  structure  extraction 


domains  are  merged,  the  generated  program  code  can  be  shorter  but  its  pro¬ 
cessing  time  can  be  longer  because  the  number  of  determining  the  conditions 
increases.  This  becomes  a  problem  for  the  program  unit  that  is  used  frequently. 
This  problem  is  resolved  by  handling  the  specification-trees  including  the  dif¬ 
ferent  input  domains  as  different  until  generalizeTree  could  be  finished.  That 
is,  until  GeneralizedTree  could  have  been  obtained,  the  structure  is  converted 
giving  the  processing  speed  the  higher  priority  and,  at  the  optimization  stage, 
it  is  decided  whether  two  specification-tree  with  the  different  input  domains  are 
to  be  merged  or  not.  Since  every  structure  has  already  been  obtained  at  the 
optimization  stage,  there  is  no  problem  to  merge  two  trees  as  long  as  the  tree 
structure  is  the  same  even  if  the  trees  have  the  different  input  domains.  When 
two  tree  structures  are  different  to  each  other  because  of  the  difference  of  the 
input  domains,  there  is  no  fixed  basis  to  decide  whether  to  reduce  the  program 
size  by  merging  or  to  keep  processing  speed  high  by  leaving  the  tree  without 
being  merged.  The  extra  condition,  for  example  by  human  decision,  is  necessary. 

4.5  Branch  Conditions 

Finally  a  method  to  extract  branch  condition  is  discussed.  Extracting  branch 
condition  is  achieved  after  GeneralizedTree  could  have  been  generated  and 
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optimization  would  have  been  finished.  Specification-tree  at  this  stage  includes 
information  on  an  execution  order  of  program  components  and  function  call 
conditions,  branch  condition  at  the  disjunctive  node  is  not  described  explicitly. 
It  is  necessary  for  identifying  the  branch  condition.  There  are  two  cases  in  the 
inference  operation  to  select  rules  that  induce  branch  operation.  One  is  the 
case  in  which  different  rules  are  selected  by  backtracking.  The  other  is  the  case 
rule  is  selected  by  matching  of  variables.  In  case  of  using  KAUS,  rule  selection 
by  matching  the  domains  can  occur  because  it  includes  domain  information 
explicitly. 

Operation  performed  at  extracting  branch  condition  is  identification  of  predi¬ 
cate  that  is  defined  to  represent  conditional  operation  and  generating  a  decision 
tree  for  rule  selection  by  matching  between  identified  branch  condition  and  varia¬ 
ble.  These  methods  are  now  being  investigated.  At  the  moment,  such  a  decision 
tree  is  not  made  but  the  condition  is  given  to  every  variable.  Also  the  predicate 
that  has  been  defined  as  representing  conditional  operation  it  is  shown  explicitly 
that  it  forms  a  branch  condition.  Since  it  is  anticipated  that  the  conditional  ope¬ 
ration  is  made  intentionally  when  it  is  described  as  a  procedural  specification,  it 
is  possible  to  ask  persons  to  describe  it  explicitly.  But  it  is  necessary  in  future 
to  study  on  a  method  to  detect  conditional  operation  automatically. 


5  Summary 

A  method  of  translating  human  activities  into  a  program  is  discussed.  To  allow 
human-like  representation  and  to  replace  the  subject  of  programming  from  per¬ 
son  to  computer,  a  new  modeling  scheme  is  introduced.  By  using  domain  which  is 
defined  to  variables  of  rules,  the  optimum  program  specification  as  specification- 
tree  can  be  generated  from  rules.  The  system  based  on  this  method  discussed 
here  is  now  being  developed.  We  already  developed  the  basic  part  of  this  system 
and  now  implementing  the  method  of  domain  tracing. 
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Abstract  An  evolutionary  algorithm  (EA)  approach  is  used  in  the 
development  of  a  test  vector  generation  application  for  single  and  multiple 
fault  detection  of  growth  faults  in  Programmable  Logic  Arrays  (PLA). 
Evolutionary  algorithms  are  search  and  optimization  procedures  that  find  their 
origin  and  inspiration  in  the  biological  world.  In  this  paper,  we  apply  the 
genetic  operators  to  the  CNF-satisfiability  problem  for  the  generation  of  test 
vectors  for  growth  faults.  CNF  has  several  advantages,  there  are  not 
dependencies  between  bits:  any  change  would  result  in  a  legal  (meaning) 
vector  (either  a  minterm  or  a  maxterm).  Thus  we  can  apply  mutations  and 
crossover  without  any  need  for  decoders  or  repair  algorithms.  The  crossover 
operation  unlike  previous  operators  used  in  PLA  test  generation,  does  not  use 
lookups  or  backtracking. 


1  Introduction 

Recent  Literature  has  addressed  the  problem  of  PLA  test  generation  [1],  [2],  and  [4]. 
Several  algorithms  have  been  also  proposed  for  PLA  testing  using  the  sharp  (T) 
operation  or  a  modified  version  of  this  operation,  but  they  tend  to  be  computationally 
expensive  [4],  [5],  and  [6].  Smith  [7]  suggests  simplifying  the  algorithm  by 
generating  a  test  for  every  fault.  This  results  in  considerably  larger  test  vectors. 
Hence,  a  minimal  test  set  is  not  guaranteed.  Other  approaches  employ  additional 
hardware,  which  means  greater  costs,  and  potential  degradation  of  PLA  performance. 
In  [8]  an  overhead  ranging  from  20%  to  50%  has  been  reported  for  large  PLAs  using 
various  such  methods. 

However,  PLA  testing  based  on  genetic  and  evolutionary  algorithms  is  in  its 
earliest  development.  An  algorithm  for  shrinkage  faults  using  genetic  algorithms  is 
proposed  recently  in  [3], 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  186-195,  2000. 
©  Springer-Verlag  Berlin  Heidelberg  2000 


PLAtestGA:  A  CNF-Satisfiability  Problem 


187 


In  this  paper  we  present  an  algorithm  for  PLA  test  generation  and  its 
implementation  using  genetic  operators  to  the  CNF-satisfiability  problem  which 
shows  that  test  pattern  generation  can  be  very  efficient.  This  technique  eliminates 
operations  (such  as  backtracking  and  T  operation)  that  can  become  computationally 
intractable  with  increasing  PLA  size. 


2  Fault  Modeling 

In  testing  digital  circuits,  the  most  commonly  considered  fault  model  is  the  stuck-at 
fault  (i.e.,  s-a-0  or  s-a-1).  However,  because  of  the  PLA’s  array  structure  the  stuck-at 
fault  alone  cannot  adequately  model  all  physical  defects  in  a  PLA  [9].  The 
intersections  between  product  lines  and  input  bit  lines  or  between  output  function 
lines  and  product  term  lines  are  called  crosspoints.  Each  product  line  is  used  to 
realize  an  implicant  (product  term)  of  the  given  function  by  placing  appropriate 
crosspoint  devices  into  what  is  known  as  the  AND  plane.  Therefore,  a  new  fault 
class  model,  known  as  the  crosspoint  model  is  used.  The  unintentional  presence  or 
absence  of  a  device  in  the  PLA  causes  a  crosspoint  fault. 

The  focus  of  this  paper  is  on  the  use  of  genetic  algorithms  for  the  generation  of 
test  vectors  for  growth  faults. 


3  The  Growth  Fault 

Growth  faults  correspond  to  the  removal  of  a  literal,  in  the  AND  plane,  from  an 
implicant  (product  term)  of  the  function  which  causes  the  growth  of  the  implicant.  A 
growth  fault  causes  the  ON-set  (i.e.,  minterms)  of  a  fault-free  PLA  to  grow  into  the 
OFF-set  (i.e.,  maxterms).  To  detect  a  fault  in  a  PLA,  it  is  important  that  the  PLA 
output  in  the  presence  of  the  fault  differs  from  the  PLA  output  in  the  absence  of  the 
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fault.  The  two  requirements  for  fault  detection  are:  Fault  Sensitization  and  Fault 
Propagation. 
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Fig.  2.  Example  PLA  for  Growth  Faults 


A  missing  device  fault  in  the  AND  plane  will  be  sensitized  if  and  only  if  the 
implicant  under  testing  carries  a  0  when  fault-free  and  if  the  implicant  carries  a  1  in 
the  presence  of  the  fault.  Once  a  fault  has  been  sensitized  then  a  propagation  path 
must  be  established,  otherwise  the  fault  is  masked  1.  The  propagation  is  done  by 
deselecting  all  other  product  lines  connected  to  the  output  except  the  product  term 
under  testing. 

The  procedure  for  deriving  the  growth  test  vectors  is  explained  with  aid  of  Fig.  1. 

The  product  term  is  represented  by  an  AND  gate  of  4  inputs.  A  dash  in  the 
input  lines  indicates  the  absence  of  a  device,  whereas  a  circle  ‘O’  on  the  input  lines 
indicates  the  presence  of  a  device. 

For  example,  to  detect  a  missing  device  at  xl  of  the  implicant  under  consideration 
[1X01],  a  logic  0  must  be  applied  to  the  input  xl,  while  the  care  values  (at  input  x3 
x4)  remain  unchanged.  Since  the  value  of  the  literal  was  changed  from  1  to  0,  then  a 
term  from  this  set  could  detect  a  fault  in  the  uncomplemented  bit-line.  To  detect  a 
fault  in  the  complemented  bit-line  the  literal  must  be  changed  from  0  to  1 . 

Now  we  should  be  able  to  sensitize  this  fault  (if  one  exists)  at  the  output  of  the 
AND  gate  under  consideration.  A  value  1  at  the  output  of  the  AND  gate  denotes  a 
fault  while  the  implicant  under  test  carries  a  0  in  the  absence  of  a  fault.  To  generate  a 
0  on  the  product  line  (required  for  sensitization  of  growth  faults)  the  input  value 
connected  to  the  target  growth  fault  bit-line  is  toggled  to  the  value  opposite  the  value 
representing  the  used  bit-line. 

The  small  PLA  of  Fig.  2  is  used  as  a  running  example  for  illustrating  the  test 
pattern  generation  for  growth  faults  using  genetic  algorithms  (GA).  The  function  of 
Fig.  2  can  be  expressed  both  as  a  truth  table  or  as  sum-of-products  as  shown  below: 

/(x1,x2,x3,x4)  =  ^(0,l,2,4,5,6,9,13) 


1  The  necessary  condition  under  which  masking  occurs  in  PLA  is  given  by  [10], 
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Fig.  3.  The  Growth  Term  of  the  PL  A 


3.1  Growth  Term 

The  above  discussion  leads  to  the  following  rules  that  must  be  established  for  the 
generation  of  test  vectors,  for  growth  faults.  A  growth  term  stands  for  the  set  of  extra 
terms  contributed  by  a  growth  fault. 

The  growth  term  from  a  given  set  of  product  terms  is  derived  as  follows: 

PROCEDURE  1: 

For  each  product  line  mj 
Do  (n  -  log2n)  times 
{ 

Construct  a  growth  term  as  follows: 

Scan  the  product  term  from  left  to  right  until  an  unmarked  literal  is  found; 
Mark  the  literal  and  toggle  its  value  from  0  to  1,  or  from  1  to  0; 

Leave  the  other  components  of  the  product  term  intact  (both  literal  and  don’t 
care  values).  These  extra  terms  correspond  to  the  growth  term. 

} 

(n  -  log2£2)  is  the  number  of  literals  on  each  product  term  [3]. 

The  growth  term  for  the  sample  PLA  is  derived  using  Procedure  1  (see  Fig.  3). 
The  fully  redundant  growth  terms  are  underlined.  A  growth  term  is  fully  redundant 
when  it  is  fully  covered  by  one  or  more  input  product  terms.  A  growth  term  may  be 
partially  redundant,  i.  e.,  partially  covered  by  one  or  more  input  product  terms. 

A  growth  term  may  have  terms  in  the  ON-set  function  (i.  e.,  minterns)  and  in  the 
OFF-set  function  (i.  e.,  maxterms).  The  terms  in  the  ON-set  function  fail  to  select 
uniquely  the  product  line  on  which  the  target  is  located,  since  it  will  also  select  the 
product  terms  that  cover  them.  Therefore,  the  fault  can  not  be  propagated. 
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Furthermore,  since  a  fault  can  be  sensitized  by  a  term  from  the  ON-set  function,  it  is 
necessary  to  delete  those  terms  from  the  growth  term.  This  procedure  can  be  carried 

out  by  computing  the  intersection  (denoted  by  (^))  between  the  growth  term 
generated  for  each  product  with  the  complement  function  (OFF-set)  [4],  [6],  One  of 
the  disadvantages  in  this  approach  is  the  backtracking  that  could  occur  when  the  test 
is  chosen  and  fails  to  propagate. 

Another  approach  is  to  apply  the  sharp  operation  (T)  between  the  growth  term 
generated  for  each  product  term  and  the  ON-set  function.  Bose  in  [1]  and  [5]  uses 
this  operation  to  find  terms  that  are  not  covered  by  the  ON-set  function.  However, 
terms  partially  covered  by  any  input  product  can  not  be  eliminated  from  the  growth 
term  without  eliminating  the  growth  term.  That  is,  an  invalid  test  vector  may  be 
generated  after  the  Quine-McCluskey  method  is  applied. 

3.2  Conjunctive  Normal  Form 

PLAtestGA  uses  the  conjunctive  normal  form  (CNF)  logical  expression,  equivalent 
to  the  complement’s  function,  to  derive  the  test  set  for  growth  faults.  The  use  of  the 
CNF  is  supported  by  the  De  Morgan’s  theorem  [11].  The  terms  complement  and 
OFF-set  function  are  equivalent. 

The  following  logical  expression  of  the  example  PLA  of  Fig.  2  is  in  CNF: 

/'(*!  ,  x2 ,  x3 ,  x4 )  =  (x[  OR  OR  x\ )  AND 

(x,  OR*;  OR x4 ) AND 
(jcj  OR  x'2  OR  X3 )  AND 
(x,  OR  x2  OR  x3 )  AND 
(x,  OR  x3  OR  x\ ) 

We  apply  a  genetic  algorithm  to  the  CNF-satisfiability  problem  for  the  generation 
of  test  vectors  for  growth  faults.  The  problem  is  to  determine  whether  there  exists  a 
truth  assignment  for  the  variables  in  the  expression,  so  that  the  CNF  expression 
evaluates  to  TRUE.  For  example,  the  above  CNF  logical  expression  has  several  truth 
assignment  (a  valid  candidate  test  vector)  for  which  the  whole  expression  evaluates 
to  TRUE,  e.  g.,  any  assignment  with  x3  =  TRUE  and  x4  =  TRUE.  The  CNF 
expression  of  the  example  PLA  is  made  up  of  five  clauses.  That  will  allow  us  to  rank 
potential  bit  pattern  solutions  in  the  range  of  0  to  5,  depending  on  the  number  of 
clauses  that  pattern  satisfies.  Table  I  shows  the  fitness  of  each  element.  When  a 
pattern  has  a  fitness  of  5,  a  maxterm  of  the  function  is  evaluated.  A  growth  fault  can 
be  detected  by  this  pattern  if  the  intersection  exists  with  a  term(s)  from  the  growth 

term  set.  The  growth  term  set  is  the  union  ( ^ )  of  the  growth  term  generated  by  each 
product  term  (refer  to  Procedure  1).  It  is  important  to  understand  that  an  undetectable 
fault  can  not  be  detected  by  any  pattern. 

It  is  hard  to  imagine  a  problem  with  better  suited  representation:  a  binary  vector 
of  fixed  length  similar  as  the  PLA  physical  layout  should  do  the  job.  There  are  other 
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several  advantages,  there  are  not  dependencies  between  bits:  any  change  would  result 
in  a  legal  (meaning)  vector  (either  a  minterm  or  a  maxterm).  Thus  we  can  apply 
mutations  and  crossovers  without  any  need  for  decoders  or  repair  algorithms.  Even 
other  less  frequently  used  genetic  operators,  such  as  the  inversion  (reversing  the 
order  of  bits  in  the  pattern)  or  exchange  (interchanging  two  different  bits  in  the 
pattern)  leave  the  resulting  bit  pattern  a  legitimate  possible  solution  [12],  [13], 

4  Test  Generation  Using  Genetic  Operators 

The  basic  genetic  algorithm,  where  P(t)  is  the  population  of  strings  at  generation  t  is 
given  below: 

procedure  genetic  algorithm 

{ 

set  time  t  :=  0 

select  an  initial  population  P(t) 

while  the  termination  condition  is  not  met,  do: 

{ 

evaluate  fitness  of  each  member  of  P(t); 
select  the  fittest  members  from  P(t); 

generate  offspring  of  the  fittest  pairs  (using  genetic  operators); 
replace  the  weakest  members  of  P(t)  by  these  offspring; 
set  time  t  :=  t+1 

} 

} 

Selection  is  done  on  the  basis  of  relative  fitness  and  it  probabilistically  eliminates 
from  the  population  those  candidate  test  vectors  which  have  relatively  low  fitness. 
Recombination,  which  consists  of  mutation  and  crossover,  imitates  sexual 
reproduction. 

Crossover  is  performed  with  crossover  probability  Pcross  between  two  selected 
strings,  called  parents,  by  exchanging  parts  of  their  genomes  (i.e.,  encoding)  to  form 
two  new  individuals,  called  offspring.  It  is  implemented  by  choosing  a  random  point 
between  1  and  the  string  length  (8)  minus  one  [1,  5  -  1]  in  the  selected  pair  of 
parents  and  exchanging  the  substring  defined  by  that  point  (i.e.,  swap  the  tail  portion 
of  the  string)  to  produce  new  offspring.  That  is,  all  the  information  from  one  parent  is 
copied  from  the  start  up  to  the  crossover  point,  and  then  all  the  information  from  the 
other  parent  is  copied  from  the  crossover  point  to  the  end  of  the  offspring 
(chromosome).  The  new  chromosome  thus  gets  the  head  of  one  parent’s  chromosome 
combined  with  the  tail  of  the  other.  For  example,  consider  strings  1  and  3  of  Table  I 
from  our  example  initial  population.  See  Fig.  4. 

In  choosing  a  random  number  between  I  and  4,  we  obtain  a  K  =  2  (as  indicated 
by  the  separator  symbol  | ).  The  resulting  crossover  yields  two  new  strings  that  are 
part  of  the  generation.  These  offspring  are:  0  110  and  0  0  0  1. 
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Crossover  is  both  simple  and  efficient.  This  operation  enables  the  evolutionary 
process  to  move  towards  optimal  solutions  in  the  search  space.  The  usefulness  of 
crossover  is  due  to  the  combination  of  better  than  average  substrings  coming  from 
different  individuals  [14]. 

Mutation  probabilistically  chooses  a  bit  and  flips  it.  Mutation  is  needed  because  if 
selection  and  crossover  together  search  new  solutions,  they  tend  to  cause  rapid 
convergence  and  there  is  a  danger  of  losing  potentially  useful  genetic  materials,  such 
as  Os  and  Is  at  particular  location  of  the  specified  values  of  the  candidate  test  vector 
under  evolution.  The  usual  interpretation  of  bit  mutation  rate  is  the  following:  for 
each  string  in  the  population  and  for  each  bit  within  the  string  generate  a  random 
number  r  between  0  and  1,  if  r  <  Pmut  flip  the  bit.  This  operator  is  applied  to  strings 
1,  6,  and  8  of  Table  I.  For  example,  string  1  is  changed  from  0110  to  0111  after 
mutation. 

The  following  definition  applies  to  the  discussions  that  follow. 

Definition  1:  Hamming  distance 

The  number  of  bit  positions  in  which  two  product  terms  hold  non-don’t  care 
values  that  are  different  is  called  the  Hamming  distance,  dH. 

For  example,  in  Fig.  2  the  hamming  distance  between  m3  and  m4  is  one,  i.e.  these 
terms  differ  in  bit  position  x2. 

The  following  GA  parameters  are  used  for  testing  growth  faults  of  the  example 
PLA  of  Fig.  2: 

•  Uniform  Crossover  Single  Cut  Point 

•  Number  of  generations  :  Until  a  minimal  test  set  is  found 

•  Size  of  Population  :  8 

•  Crossover  Probability  :  1.0 

•  Mutation  Probability  :  0.1 

PLAtestGA  begins,  at  generation  0,  with  a  population  of  8  patterns.  For  each 
generation,  each  individual  in  the  population  is  calculated  as  the  number  of  clauses 
that  pattern  satisfies.  A  maximum  value  of  5  means  that  the  pattern  (candidate  test 
vector)  matches  each  clause  of  the  CNF  expression  and  consequently  it  is  a  valid 
candidate.  For  example,  the  fitness  for  the  pattern  [0001]  of  Table  I  is  3,  while  the 
fitness  for  the  pattern  [1110]  is  5.  The  string  [1110]  in  particular  permits  the 
detection  of  missing  devices  in  product  terms  that  are  not  activated,  i.  e.,  product 
terms  that  are  not  compatible  with  the  pattern  under  consideration. 

The  patterns  {[1110],  [0111]}  generated  on  Table  I  at  the  end  of  generation  0  are 


String  1 


String  2 


Fig.  4.  The  Crossover  Operation 
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valid  tests.  Since  genetic  operators  generate  the  pattern  [1110],  then  it  is  a  valid 
candidate  test.  This  is  necessary  to  assure  propagation  of  the  fault.  The  next  step  is  to 
determine  if  there  are  product  terms  with  a  dH  equal  to  1  from  the  pattern  (candidate 
test).  The  missing  fault  that  can  be  detected  by  this  pattern  is  the  complement  bit-line 
of  the  first  input  line  of  the  product  term  m2. 

Table  1.  Generation  0 
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Population 

X,  X,  X3  X4 
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0 

1 
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4 

i 
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17 

0  1  ®  1 

0  1  ©  0 

j$um  35 
l  Average  4.25 
I  Max  5 
iMin  3 

The  following  Lemma  is  necessary  to  the  present  discussion. 

Lemma  1.  A  maxterm  generated  with  the  genetic  operators  with  a  dH  equal  to  1 
from  any  product  term  is  qualified  to  detect  a  missing  device  fault  in  that  product. 

The  proof  of  this  Lemma  is  supported  by  Procedure  1  and  the  CNF  used  for  the 
pattern  generation. 

The  string  [0111]  uses  Lemma  1  to  find  product  terms  which  are  dissidents  in  one 
literal.  These  product  terms  are:  m2,  m3,  and  m5.  Therefore,  this  pattern  detects  a 
growth  fault  in  the  product  terms  where  they  have  the  dissident  bits  (one  Hamming 
distance  away).  The  missing  faults  detected  by  the  pattern  [0111]  (with  fitness  5) 

with  the  aid  of  Lemma  1  are  shown  in  Fig.  3.  The  faults  detected  are  circled  by  a 
broken-line  in  their  respective  positions  in  the  PLA. 

The  patterns  {[0111],  and  [1110]}  have  a  fitness  of  5.  The  fittest  members  can  be 
selected  more  than  once.  A  bias  roulette-wheel  is  used  as  the  reproduction  operator 
(see  Fig.  5).  To  reproduce,  we  simply  spin  the  weighted  roulette  wheel  thus  defined 
eight  times,  each  with  a  size  proportional  to  the  pattern  fitness.  A  probability  is 
assigned  to  each  pattern  as  follows: 


Sum 

A  cumulative  probability  is  obtained  for  each  pattern  by  adding  up  the  fitness  of 
the  preceding  population  members: 

i 

Cj  =  ^  pk ,  i- 1,2, . . . ,  population  Size 
k 
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Fig.  5.  Biased  Roulette  Wheel 

In  this  way  the  selection  of  fittest  members  have  proportionally  more  chances  of 
being  reproduced  and  patterns  can  be  selected  more  than  once.  Table  I  shows  how 
many  times  an  individual  is  reproduced.  Once  the  new  population  has  been 
reproduced,  strings  are  paired  at  random  and  recombined  through  crossover.  The  new 
individuals  (patterns)  will  enter  the  new  population  in  place  of  their  parents. 
Crossover  is  applied  with  a  frequency  Pcross  =  1.0. 

After  crossover,  mutation  is  applied  to  the  population  members  with  a  frequency 
Pmut  =  0.1.  It  is  interesting  to  note  that  after  these  genetic  operators  are  applied  in 
each  generation  the  population  average  fitness  continues  to  improve  until  the 
population  becomes  little  differentiated  and  the  fitness  levels-off. 

The  final  growth  test  set  for  the  PLA  under  consideration  were  found  after  the 
second  generation.  The  test  vectors  are:  {[0011],  [0111],  [1100],  [1110],  [1111]}. 


5  Conclusions 

This  article  describes  the  use  of  genetic  operator  to  the  CNF-satisfiability  problem 
for  testing  growth  faults  in  PLAs.  Existing  methods  tend  to  be  computationally 
expensive. 

The  drawback  of  these  methods  is  the  backtracking  that  could  occur  when  the  test 
is  chosen  and  fails  to  propagate.  Our  proposed  algorithm  overcomes  this  problem  to 
generate  good  solutions  efficiently.  The  CNF  constraint  satisfaction  problem  has 
several  advantages  over  other  approaches  used  for  testing  PLAs.  It  eliminates  the 
possibility  of  intersecting  a  redundant  growth  term  with  a  candidate  test  vector. 
Therefore,  backtracking  is  not  needed.  Also,  a  minimal  test  set  is  guaranteed. 
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Abstract.  Constraint  satisfaction  problems  (CSPs)  occur  widely  in  ar¬ 
tificial  intelligence.  In  the  last  twenty  years,  many  algorithms  and  heu¬ 
ristics  were  developed  to  solve  CSP.  Recently,  a  constraint-graph  based 
evolutionary  algorithm  was  proposed  to  solve  CSP,  [17].  It  shown  that  it 
is  advantageous  to  take  into  account  the  knowledge  of  the  constraint  net¬ 
work  to  design  genetic  operators.  On  the  other  hand,  recent  publications 
indicate  that  parallel  genetic  algorithms  (PGA’s)  with  isolated  evolving 
subpopulations  (that  exchange  individuals  from  time  to  time)  may  of¬ 
fer  advantages  over  sequential  approaches,  [1].  In  this  paper  we  examine 
the  gain  of  the  performance  obtained  using  multiple  populations  -  that 
evolve  in  parallel  -  of  the  constraint-graph  based  evolutionary  algorithm 
with  a  migration  policy.  We  show  that  a  multiple  populations  approach 
outperforms  a  single  population  implementation  when  applying  it  to  the 
3-coloring  problem. 


1  Introduction 

Constraint  satisfaction  problems  (CSPs)  occur  widely  in  artificial  intelligence. 
They  involve  finding  values  for  problem  variables  subject  to  constraints  on  which 
combinations  are  acceptable.  For  simplicity  we  restrict  our  attention  here  to 
binary  CSPs,  where  the  constraints  involve  two  variables.  Binary  constraints 
are  binary  relations.  If  a  variable  i  has  a  domain  of  potential  values  Di  and  a 
variable  j  has  a  domain  of  potential  values  Dj,  the  constraint  on  i  and  j,  Rjj, 
is  a  subset  of  the  Cartesian  product  of  Di  and  Dj.  If  the  pair  of  values  a  for  i 
and  b  for  j  is  acceptable  to  the  constraint  i?y  between  i  and  j ,  we  will  call  the 
values  consistent  (with  respect  to  Rij).  The  entity  involving  the  variables,  the 
domains,  and  the  constraints,  is  called  constraint  network.  In  the  last  twenty 
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years,  many  algorithms  and  heuristics  were  developed  to  find  a  solution  in  con¬ 
straint  network  [6],  [10],  [9].  Following  these  trend  from  the  constraint  research 
community  in  the  evolutionary  computation  community  some  approaches  was 
also  proposed  to  tackle  CSP  with  success  [7],  [8],  [12],  [18]  in  particular  [16],  [17] 
proposed  an  evolutionary  algorithm  based  on  the  constraint  network  to  solve 
CSP.  Our  motivations  to  present  a  parallel  version  for  CSP  are  threefold.  In  our 
knowledge,  all  published  evolutionary  algorithms  that  address  the  constraint  sa¬ 
tisfaction  problems  are  sequential  approaches,  i.e.,  one  (or  two  in  the  case  of 
co-evolution  [14])  population  evolves  by  means  of  genetic  operators.  However, 
recent  publications  indicate  that  parallel  genetic  algorithms  (PGA’s)  with  isola¬ 
ted  evolving  subpopulations  (that  exchange  individuals  from  time  to  time)  may 
offer  advantages  over  sequential  approaches,  [1],  The  contributions  of  this  paper 
are: 


—  comparisons  of  the  performance  of  the  multiple  populations  evolving  in  par¬ 
allel  with  previous  sequential  strategies. 

—  for  this  specific  algorithm,  an  investigation  on  the  influence  of  various  par¬ 
allelization  parameters  on  its  performance. 

Throughout  this  work,  we  will  use  the  term  MpGA  to  describe  a  genetic  al¬ 
gorithm  with  multiple  populations  (population  structures)  evolving  in  parallel. 
Accordingly,  “sequential  genetic  algorithm”  indicates  a  genetic  algorithm  with  a 
single  population.  This  usage  is  consistent  with  many  previous  papers.  However, 
it  is  important  to  note  that  “parallel”  and  “sequential”  refer  to  population  struc¬ 
tures,  not  the  hardware  on  which  the  algorithms  are  implemented.  In  particular, 
the  MpGA  could  be  simulated  on  a  single  processor  platform  (as  any  discrete 
parallel  process  can)  and  the  sequential  genetic  algorithm  could  be  executed  on 
a  multiprocessor  platform. 


2  Problem  Formulation 

The  problem  at  hand  is  that  of  constraint  satisfaction  problem  (CSP)  defined 
in  the  sense  of  Mackworth  [11],  which  can  be  stated  briefly  as  follows:  We  are 
given  a  set  of  variables,  a  domain  of  possible  values  for  each  variable,  and  a 
conjunction  of  constraints.  Each  constraint  is  a  relation  defined  over  a  subset  of 
the  variables,  limiting  the  combination  of  values  that  the  variables  in  this  subset 
can  take.  The  goal  is  to  find  a  consistent  assignment  of  values  to  the  variables 
so  that  all  the  constraints  are  satisfied  simultaneously. 

CSP’s  are,  in  general,  NP-complete  and  some  are  NP-hard  [4],  Thus,  a  general 
algorithm  designed  to  solve  any  CSP  will  necessarily  require  exponential  time 
in  problem  size  in  the  worst  case.  Resulting  from  these  considerations,  three 
objectives  are  used  in  this  work  to  asses  the  quality  of  the  solution:  A  fitness 
function  that  takes  into  account  the  connection  degree  in  the  constraint  network, 
a  dynamic  adaptation  of  the  genetic  operators,  and  a  parallel  migration  between 
populations. 
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2.1  Notions  on  CSP 

A  Constraint  Satisfaction  Problem  (CSP)  is  composed  of  a  set  of  variables 
V  =  {Xi, . . . ,  Xn},  their  related  domains  D\, . . .  ,Dn  and  a  set  8  containing 
t i  constraints  on  these  variables.  The  domain  of  a  variable  is  a  set  of  values  to 
which  the  variable  may  be  instantiated.  The  domain  sizes  are  to  1( . . . ,  mn .  re¬ 
spectively, and  we  let  m  denote  the  maximum  of  the  to,;.  Each  variable  Xj  is 
relevant  (in  the  next  we  denote  “being  relevant  for”  by  >),  to  a  subset  of  con¬ 
straints  Cjl , . . . , Cjk  where  (ji ,  •  •  • ,  jk }  is  some  subsequence  of  {1,2, ...  ,r]}.  A 
constraint  which  has  exactly  one  relevant  variable  is  called  a  unary  constraint. 
Similarly,  a  binary  constraint  has  exactly  two  relevant  variables.  A  binary  CSP  is 
associated  with  a  constraint  graph,  where  nodes  represent  variables  and  arcs  re¬ 
present  constraints.  If  two  values  assigned  to  variables  that  share  a  constraint  are 
not  among  the  acceptable  value-pairs  of  that  constraint,  this  is  an  inconsistency 
or  constraint  violation. 

Definition  2.1.  (Constraint  Matrix) 

A  Constraint  Matrix  R  is  a  rj  x  n  rectangular  array,  such  that: 


P-q  j  R[o^,  j\ 


J  1  if  variable  Xj  \>  Ca 
(  0  otherwise 


Definition  2.2.  (Instantiation) 

An  Instantiation  I  is  a  mapping  from  a  n-tuple  of  variables  (X\, . . . ,  Xn)  — > 
Pi  x  ...  x  Dn,  such  that  it  assigns  a  value  from  its  domain  to  each  variable  in 
V. 


Definition  2.3.  (Constraint  Arity) 

We  define  the  Constraint  Arity  for  a  constraint  Ca,  aa,  as  the  number  of  relevant 
variables  for  Ca . 

Definition  2.4.  (Partial  Instantiation) 

Given  Vp  C  V,  a  Partial  Instantiation  Ip  is  a  mapping  from  a  j-tuple  of  variables 
(XP1 , . . . ,  XPj )  — >  DPl  x  ...  x  DPj ,  such  that  it  assigns  a  value  from  its  domain 
to  each  variable  in  Vp. 

Note:  For  a  given  Ip  we  will  talk  about  satisfaction  of  Ca  iff  all  of  their  relevant 
variables  are  instantiated. 

A  solution  to  the  CSP  consists  of  an  instantiation  of  all  the  variables  which  does 
not  violate  any  constraint. 


3  Network-Based  Evolutionary  Algorithm 

The  algorithm  uses  a  non-binary  genetic  representation.  The  initial  population  is 
generated  randomly.  The  variable  values  are  selected  from  their  domains  with  a 
uniform  probability  distribution.  The  selection  algorithm  is  biased  to  the  better 
evaluated  individuals. 
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3.1  Fitness  Function 

In  [15]  we  propose  a  fitness  function  specifically  defined  for  CSP,  which  we  de¬ 
scribe  briefly  in  the  next  two  definitions. 

Definition  3.1.  (Error- evaluation) 

For  a  binary  CSP  with  a  constraint  matrix  R,  an  instantiation  I,  and  a  binary 
non-satisfied  constraint  Ca  which  has  Xk  and  Xi  as  relevant  variables  (it  has 
just  two,  exactly  these  two),  we  define  the  Error- evaluation  e(Ca,I)  by: 

e(Ca,I)  =  a a  +  (Propagation  Effect  Xk  and  Xi) 
where  Propagation  Effect  Xk  and  Xi  in  a  binary  constraint  network,  is  defined 
as  the  number  of  constraints  Cg,  (3  =  1, . . . ,  r),  ft  ^  a  that  have  either  Xk  or  Xi 
as  relevant  variables. 

Remark  3.1.  If  Ca  is  satisfied  then  e(Ca,I)  is  equal  to  zero 

The  fitness  function  is  the  sum  of  the  Error-evaluations  (equation  3.1)  of  all 
constraints  in  the  CSP,  that  is: 

Definition  3.2.  (Fitness  Function) 

For  a  binary  CSP  with  constraint  matrix  R  and  an  instantiation  I,  and  Error- 
evaluation  e(Ca,  I )  for  each  constraint  Ca,  (a  =  1, . . . ,  rj),  the  Fitness  Function 
Z(I)  is: 

Z(I)  =  J>(Ca,J)  (1) 

a=l 

The  goal  of  the  search  is  to  minimize  Z(I),  which  equals  to  zero  when  all  con¬ 
straints  are  satisfied. 

3.2  Operator:  Constraint  Dynamic  Adapting  Crossover 

Constraint  Dynamic  Adapting  Crossover  uses  the  idea  that  there  are  not  fixed 
points  to  make  crossover.  It  makes  a  crossover  between  two  randomly  selected 
individuals  to  create  a  new  one.  The  child  inherits  its  variable  values  using  a 
greedy  procedure,  which  analyzes  each  constraint  (arc)  according  to  a  dyna¬ 
mic  priority.  The  constraint  dynamic  priority  not  only  takes  into  account  the 
network  structure,  but  also  the  current  values  of  the  parents.  The  priority  is 
constructed  using  the  next  procedure:  First,  it  identifies  the  number  of  violati¬ 
ons,  nv,  that  means,  between  both  parents  selected,  how  many  are  violating  the 
current  constraint.  Second,  it  classifies  the  constraints  in  one  of  the  following 
three  categories  :  0,  1  or  2  number  of  violations.  Finally,  within  each  category 
(0,1  or  2  number  of  violations),  the  constraints  are  ordered  according  to  their 
contribution  to  the  fitness  function.  To  make  crossover  the  operator  uses  two 
partial  fitness  functions.  The  first  one  is  the  partial  crossover  fitness  function, 
cfF,  which  allows  us  to  guide  the  selection  of  a  combination  of  variable  values 
by  constraint.  The  second  one  is  the  partial  mutation  fitness  function  mfF  for 
choosing  a  new  variable  value.  The  whole  process  is  introduced,  with  details,  in 
[17]. 
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4  Model  and  Migration  Policy 

The  Migration  Model  used  in  this  work  is  shown  in  figure  1.  The  model  is  made 
up  of  a  master  node  and  i  nodes.  Each  node  has  a  population  that  evolves  inde¬ 
pendently  using  the  algorithm  described  in  the  previous  section.  The  master  goal 
is  to  send  the  initial  parameters  to  each  node  (population  size,  seed,  mutation 
and  crossover  probabilities).  Once  the  nodes  receive  the  parameters  each  node 
is  ready  to  begin  its  evolution. 


4.1  Migration  Policy 

We  define  a  parameter  called  “migration  rate” ,  it  specifies  the  number  of  itera¬ 
tions  required  before  sending  the  best  individuals  to  the  neighboring  node.  This 
model  also  accepts  another  interesting  parameter,  this  is  the  number  of  indivi¬ 
duals  migrating  from  each  node.  In  the  figure  1  dot  lines  shown  the  migration 
policy,  that  is: 

-  each  node  sends  to  its  neighboring  node  its  better  individuals  found  until 
now 

-  each  node  receives  the  best  individuals  from  its  neighboring  node,  incorpo¬ 
rating  them  to  its  population 


Best 


Fig.  1.  Migration  Model 

Therefore  our  model  needs  the  following  parameters: 

—  i\  number  of  nodes 

—  Mr  :  Migration  Rate 

—  MtE:  Members  to  Exchange 

—  Popsize:  population  size  of  each  node 
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5  Tests 

The  aim  of  the  experiments  was  to  investigate  the  effect  to  incorporate  multiple 
populations  to  the  constraint-graph  based  evolutionary  algorithm,  and  to  com¬ 
pare  it  with  the  sequential  approach.  The  algorithm  has  been  tested  by  running 
experiments  on  randomly  generated  3-coloring  graphs,  subject  to  the  constraint 
that  adjacent  nodes  must  be  colored  differently.  We  used  the  Joe  Culberson 
library,  [5]  to  generate  the  random  graphs.  We  have  tested  the  algorithms  for  3- 
coloring  problems  with  solution,  with  a  connectivity  between  [4.1. .5. 9].  For  each 
connectivity  we  have  generated  10000  random  3-coloring  graph  problems.  In  or¬ 
der  to  discard  the  “easy  problems”  we  have  applied  DSATUR  [3]  to  solve  them. 
Thus,  we  have  selected  the  problems  not  solved  by  DSATUR.  DSATUR  is  one 
of  the  best  algorithms  to  solve  this  kind  of  problems.  The  number  of  problems 
selected  was  300  for  each  connectivity.  It  is  important  to  remark  that  it  is  easy 
to  find  problems  not  solved  by  DSATUR  in  the  hard  zone  [4],  is  not  the  case 
with  others  connectivities. 


5.1  Hardware 

The  hardware  platform  for  the  experiments  was  a  PC  Pentium  III-500  Mhz  with 
128  MB  RAM  under  LINUX.  For  parallel  support  we  have  used  PVM,  [2].  PVM 
allows  to  use  any  computer  as  a  virtual  parallel  machine  with  message-passing 
model.  The  algorithm  was  implemented  in  C  using  the  PVM  libraries. 


5.2  Results 

The  single  population  algorithm  and  the  multiple  populations  algorithm  use  the 
same  parameters  found  for  the  algorithm  introduced  in  [16],  that  is: 

—  Mutation  probability  =  0.2 

—  Crossover  probability  =  0.9 

The  parameters  of  the  migration  model  are  (i: 3,  Mr: 50,  MtE:  1,  Popsize: 20) 
Figure  2  and  3  shown  the  results  obtained.  The  multiple  populations  algo¬ 
rithm  was  able  to  solve  more  than  88  %  of  the  problems  selected,  even  in  the 
hard  zone.  Thus  it  works  better  than  the  sequential  approach.  The  number  of 
generations  requiered  was  also  reduced  using  the  parallel  approach,  that  is  shown 
in  the  figure  3. 

6  Discussion  and  Further  Issues 

We  have  obtained  better  results  applying  a  model  with  multiple  populations 
using  the  same  sequential  algorithm.  Nevertheless,  in  order  of  being  exact  in  the 
interpretation  of  the  results  we  must  consider  that  our  new  model  works  with 
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Conectivity 

Fig.  2.  Solved  Problems  by  Multiple  populations  and  single  population 


Conectivity 

Fig.  3.  Number  of  Generations  by  Multiple  populatios  and  single  population 


three  populations  instead  of  one  of  the  sequential  model.  For  that,  we  define  a 
measure  of  efficiency  as: 

Generations  serial  model 

ri  =  - 

Generations  parallel  model  x  i 


(2) 
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Thus,  we  can  conclude  that  the  efficiency  of  the  parallel  model  is  approximately 
60%  better  than  the  sequential  one.  It  suggests  that  could  be  advisable  to  explore 
the  behavior  of  the  algorithm  running  in  a  parallel  hardware  platform. 

7  Conclusion 

A  model  based  in  multiple  populations  improves  the  performance  of  the  graph- 
based  evolutionary  algorithm  that  solves  CSP.  Our  research  allows  us  to  conclude 
that  using  an  evolutionary  algorithm  with  migration  policy  we  are  able  to  solve 
around  85%  of  the  problems  that  are  in  the  hard  zone.  The  results  suggest  that 
our  technique  is  a  good  option  for  solving  CSPs. 

There  is  a  variety  of  ways  in  which  the  techniques  presented  here  can  be  ex¬ 
tended.  The  principal  advantage  of  our  method  is  that  it  is  general,  i.e.,  the 
approach  is  not  related  to  a  particular  problem.  Now  our  research  is  directed 
towards  selecting  parameters  and  testing  in  other  hardware  platforms. 
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Abstract.  This  paper  compares  a  behaviour  based  architecture,  and  a 
plan  based  architecture  for  agents  in  multi-agent  systems,  with  respect 
to  the  issue  of  robustness.  The  type  of  robustness  investigated  is  stability 
of  the  systems  as  two  aspects  -  unpredictability  and  rate  of  perception 
compared  to  speed  of  the  environment  -  are  modified.  The  comparison  is 
done  using  a  simulation  scenario  which  was  designed  to  be  constrained, 
but  to  capture  important  qualities  of  the  real  world.  The  scenario  was  also 
chosen  to  have  characteristics  that  could  be  favourable  to  both  behaviour 
based  and  plan  based  paradigms.  An  analysis  of  the  data  collected  for 
the  two  approaches  provides  strongly  suggestive  evidence  that  the  plan- 
based  system  is  more  robust  than  the  behaviour  based  one  in  some  ways. 
There  is  no  indication  of  better  robustness  in  the  behaviour  based  system 
with  respect  to  the  aspects  of  robustness  investigated. 


1  Introduction 

Two  fairly  well  established  architectural  paradigms  in  agent  systems  are  the 
behaviour  based  paradigm  and  the  plan-based  paradigm.  Both  paradigms  recog¬ 
nise  the  need  for  and  incorporate  reactivity  as  a  fundamental  quality  of  agent 
systems  in  dynamic  environments,  although  they  approach  this  in  different  ways. 
There  are  often  claims  from  behaviour  based  proponents  that  behaviour  based 
systems  are  more  robust  (e.g.  [JF92,Bro86]),  while  plan-based  proponents  often 
argue  that  plans  are  necessary  for  addressing  complex  applications. 

This  work  explores  specifically  a  particular  aspect  of  robustness,  namely  the 
stability  of  the  agent  or  program  behaviour  as  characteristics  of  the  environment 
change.  The  particular  environmental  characteristics  that  we  have  manipulated 
are  the  predictability  of  the  world  and  the  speed  of  change  in  the  world  relative  to 
the  ability  of  the  agent  to  perceive  and  act  on  those  changes.  Although  the  results 
are  not  entirely  conclusive1  the  tendency  is  that  plan-based  systems  appear  to 
be  more  stable  under  environmental  change  than  behaviour  based  systems. 

1  This  was  due  to  lack  of  sufficient  data  and  hardware  problems  which  made  it  impos¬ 
sible  to  collect  further  data. 
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The  behaviour-based  approach  relies  on  low  level  parallel  behaviours  which 
react  to  sensed  information  about  the  current  world  situation  without  any  mod¬ 
elling  of  or  reasoning  about  either  the  world  or  the  agent’s  actions  in  the  world. 
Intelligent  behaviour  is  seen  to  emerge  from  the  combination  of  simple  behaviours 
within  a  complex  environment.  There  are  several  forms  of  behaviour  based  ar¬ 
chitectures  of  which  Brooks’  subsumption  architecture  [Bro86]  is  the  most  well 
known.  Activation  nets  (e.g.  [Mae95])  are  another  approach  based  on  low  level 
parallel  behaviours. 

BDI  ( belief  \  desire,  intention )  architectures  are  plan  based  systems  which  rely 
on  a  library  of  outline  plans  which  indicate  how  to  achieve  particular  goals  in 
various  situations.  At  execution  time  the  details  of  the  plan  are  filled  out  by  the 
agent  using  sub-plans  based  on  the  actual  state  of  the  world  when  it  is  time  to 
execute  the  relevant  sub-plan.  This  approach  makes  practical  reasoning  tractable 
as  the  scope  of  deliberation  is  limited  to  choosing  between  competing  matching 
plans  [BIP88].  It  also  allows  the  agent  to  be  reactive  to  environmental  change  by 
suspending  or  aborting  execution  of  a  plan  in  favour  of  a  more  relevant  sub-plan 
or  a  plan  reacting  to  a  more  important  event.  These  systems  are  often  referred  to 
as  reactive  planners  [Fir87,AC90].  Reaction  is  to  both  external  events  conveyed 
via  sensors  or  to  events  based  on  agent  modelling  of  the  world  and  the  expected 
results  of  the  agent’s  own  actions. 

There  has  been  a  limited  amount  of  work  attempting  to  scientifically  compare 
the  applicability  of  these  two  paradigms  to  particular  problem  types.  Drogoul 
[Dro95]  has  done  some  work  in  this  direction,  attempting  to  apply  a  behaviour 
based  approach  to  chess  and  also  looking  at  certain  situations  in  Pengi2.  In  gen¬ 
eral  Pengi  is  a  game  which  is  so  dynamic  and  unpredictable  that  little  or  no 
planning  is  possible  [AC87],  However,  Drogoul  identified  some  situations  where 
it  would  appear  that  planning  could  be  advantageous.  Drogoul’s  findings  were 
that  although  a  modicum  of  success  was  achieved  in  behaviour  based  chess  it  was 
insufficient  to  allow  successful  competition  against  a  deliberative  chess  program 
such  as  GNU  chess.  However,  in  the  Pengi  situation  it  was  possible  to  simply 
provide  agents  with  more  complicated  behaviours  inorder  to  address  the  situ¬ 
ations  that  it  seemed  would  benefit  from  planning.  Prom  this  work  it  appears 
that  plan  based  systems  are  better  for  achieving  intelligent  behaviour  in  situa¬ 
tions  where  strategy  over  time  is  critical.  However,  for  applications  which  may 
be  complex  in  terms  of  multiple  competing  goals  and  the  need  for  reactivity, 
but  do  not  allow  for  strategy  because  of  the  totally  unpredictable  nature  of  the 
environment,  intelligent  behaviour  can  be  successfully  obtained  using  a  purely 
behaviour  based  approach. 

Although  robustness  is  one  of  the  often  claimed  advantages  of  behaviour 
based  systems,  to  our  knowledge  there  has  not  been  any  work  that  attempts 
to  define  some  aspect  of  robustness  and  actually  test  this  claim.  Robustness  is 
essentially  the  ability  of  the  system  to  continue  functioning  well  over  a  wide 

2  Pengi  is  based  on  the  arcade  game  Pengo  where  penguins  must  try  to  collect  as 
many  diamonds  as  possible  while  avoiding  being  being  crushed  by  moving  ice  blocks 
or  stung  by  roving  bees. 
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range  of  environmental  conditions.  In  this  work  we  have  considered  two  ways 
in  which  environmental  conditions  may  change.  The  first  is  that  the  speed  at 
which  things  happen  in  the  agent’s  world  may  change  relative  to  the  speed  of  the 
agent  itself  (or  its  ability  to  perceive  the  change).  One  example  of  this  is  that 
a  robot  in  Robocup3  with  a  camera  that  perceives  a  fixed  rate  of  frames  per 
second  may  well  perform  differently  depending  on  the  speed  of  the  other  agents 
in  the  environment.  If  an  agent’s  environment  is  made  up  of  other  systems  then 
an  upgrade  of  some  of  those  systems  may  result  in  an  effective  speed  up  of  the 
environment.  Similarly  an  upgrade  in  the  agent  system  hardware  may  result  in 
the  environment  being  perceived  to  be  slower,  relative  to  the  speed  of  the  agent 
system. 

Another  way  in  which  we  consider  the  environment  changing  is  in  its  level 
of  predictability.  When  a  system  is  built  it  is  often  tailored  in  some  way  to  the 
patterns  of  the  environment.  As  times  change  these  environmental  patterns  may 
change,  resulting  in  the  environment  being  less  predictable  than  it  was  origi¬ 
nally.  Ideally  one  would  like  the  system  to  learn  new  patterns,  but  an  ability  to 
degrade  gracefully  would  also  be  useful.  An  argument  is  often  made  that  because 
there  is  not  the  same  level  of  explicit  tailoring  in  behaviour  based  systems  as 
there  is  in  plan-based  systems,  they  should  be  able  to  better  handle  this  kind  of 
environmental  change. 

In  this  paper  we  describe  a  simulation  scenario  where  we  modify  two  param¬ 
eters  of  the  scenario  representing  these  aspects  and  measure  the  changes  in  the 
performance  of  the  two  types  of  agent  systems.  Because  we  are  interested  in  ro¬ 
bustness,  i.e.  stability  under  a  wide  range  of  conditions,  we  are  not  interested  so 
much  in  whether  one  system  is  better  than  the  other  in  any  particular  situation, 
but  rather  whether  their  rates  of  change  (as  the  environment  changes)  differ  in 
any  significant  way. 


2  Description  of  project 

We  wished  to  establish  a  scenario  that  seemed  suitable  for  either  approach, 
and  where  parameters  could  be  modified  to  simulate  changing  environmental 
conditions.  We  also  wanted  a  scenario  which  could  be  viewed  graphically  to 
allow  for  the  possibility  of  qualitative  evaluation  as  well  as  quantitative. 

The  scenario  developed  was  one  where  two  sheep-dog  agents  needed  to  herd 
a  flock  of  sheep  through  three  different  gates.  The  sheep  mostly  behaved  in  a 
predictable  flocking  manner,  but  would  occasionally  exhibit  random  breakaway 
behaviour  at  a  frequency  that  was  parameterised.  Relative  success  of  the  agents 
was  measured  as  time  taken  to  achieve  the  goal  of  herding  the  sheep  through  all 
three  gates.  Time  was  measured  in  terms  of  world  cycles. 

We  modelled  levels  of  unpredictability  by  modifying  how  often  a  sheep  would 
break  away  from  the  flock  and  move  in  a  random  direction,  rather  than  with 
the  flock.  We  modelled  relative  speed  of  the  environment  by  having  the  dogs  (or 

3  Robocup  is  an  international  forum  where  robots  compete  at  playing  soccer. 
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sheep)  move  less  than  every  world  cycle.  For  example  if  the  sheep  moved  only 
every  second  world  cycle,  then  the  world  would  be  changing  less  rapidly  from 
the  point  of  view  of  the  dogs.  If  the  dogs  received  input  and  had  a  move  only 
every  second  world-cycle,  then  the  environment  would  be  changing  more  rapidly 
from  the  dog’s  point  of  view. 

The  base  platform  used  for  the  simulation  was  the  PAC  system  [PT97] .  PAC 
attempts  to  provide  an  environment  where  scenarios  can  quickly  and  easily  be 
built  up,  varying  aspects  of  agent  personality  (or  emotions)  and  agent  cognition 
(plans  and  beliefs).  The  purpose  of  the  PAC  system  is  to  allow  experimentation 
with  different  combinations  of  agents  in  different  worlds. 

PAC  consists  of  four  modules  -  cognitive,  emotional,  behavioural  and  system 
management.  The  cognitive  module  manages  plan-based  agent  behaviour  using 
dMars,  a  descendant  of  PRS  [GI89]  and  was  used  for  implementing  the  plan- 
based  dog  agents.  The  behavioural  module  manages  the  graphical  models  for 
each  agent  plus  their  behavioural  functions  (e.g.  walk  forward,  turn,  etc.).  The 
system  management  module  integrates  the  other  modules,  manages  scenario 
simulation  and  interfaces  to  the  user  interface  for  the  interactive  system.  The 
behaviour  based  dog  agents  were  implemented  in  C++  as  a  separate  piece  of 
code  that  was  integrated  with  the  system  management  module.  The  emotion 
module  was  not  used  in  this  work. 

The  sheep  in  the  environment  were  implemented  as  simple  agents  in  C++ 
which  used  a  flocking  algorithm  [Rey87]  to  move.  While  this  was  visually  not 
quite  realistic  for  sheep  flocking  it  provided  appropriate  predictable  behaviour. 
Under  normal  circumstances  sheep  action  was  determined  by  a  blending  of 
four  basic  behaviours:  obstacle  avoidance,  velocity  matching,  flock  centering  and 
avoidance  of  the  dogs.  At  each  world  cycle  a  movement  vector  was  calculated 
and  then  multiplied  by  priorities  of  16,  4,  9  and  14  respectively.4  The  resulting 
vectors  were  then  summed  to  provide  a  final  movement  vector.  When  breakaways 
were  operational  a  sheep  would  make  a  totally  random  move  some  number  of 
times  every  thousand  world  cycles  at  randomly  determined  intervals. 

The  plan  based  agents  had  plans  to  look  for  gates  and  for  sheep,  to  move 
towards  the  flock  and  to  return  wayward  sheep  to  the  flock.  They  also  had  plans 
that  enabled  them  to  co-ordinate  their  behaviour  to  position  themselves  behind 
the  sheep  and  then  move  towards  the  gate.  The  behaviour  based  agents  had 
a  number  of  behaviours  such  as  moving  towards  the  target,  avoiding  obstacles 
and  moving  away  from  the  other  dog  if  too  close,  which  were  then  combined 
using  behaviour  blending  in  a  similar  way  to  the  sheep.  Some  of  the  behaviours 
were  disabled  if  other  behaviours  matched  strongly.  Both  agent  programs  were 
developed  to  a  point  where  they  appeared  to  be  herding  the  sheep  successfully. 

In  the  following  sections  we  describe  the  specific  experiments  that  were  run 
and  the  results  obtained. 


4  These  priorities  were  determined  by  trial  and  error. 
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3  Experiments 


One  set  of  experiments  varied  the  unpredictability  in  the  environment  by  running 
scenarios  with  four  different  rates  of  breakaway  of  sheep.  The  rates  used  were  0, 
4,  20  and  50  breakaways  per  1000  world  cycles. 

The  other  set  of  experiments  varied  the  rate  at  which  the  dogs  could  sense 
and  act  compared  with  the  rate  at  which  the  sheep  were  moving.  The  ratios 
of  dog  action  to  sheep  action  used  were  1:4;  1:3;  1:2;  1:1;  2:1;  3:1;  4:1.  These 
experiments  were  run  with  a  predictable  environment  (no  breakaways)  and  with 
a  moderately  unpredictable  environment  (20  breakaways  per  1000  cycles).  Due 
to  machine  problems  we  were  unable  to  obtain  data  for  the  behaviour  based 
agents  at  ratios  1:4,  1:3  and  3:1. 

For  each  set  of  parameters  we  ran  thirty  scenarios.  Each  scenario  started 
with  a  random  movement  by  the  first  sheep,  which  ensured  that  each  scenario 
was  unique.  The  dogs  could  herd  the  sheep  through  the  gates  sequentially  in 
any  order.  When  all  sheep  had  been  herded  through  a  particular  gate,  the  gate 
would  shut  and  the  number  of  moves  required  by  the  dogs  to  herd  the  sheep 
through  that  gate  was  recorded.  When  all  three  gates  had  shut  the  scenario  was 
concluded.  We  recorded  three  basic  pieces  of  data  for  each  scenario: 

-  Total  World  Cycles  (TWC):  the  number  of  world  cycles  taken  for  the 
scenario  from  start  to  finish. 

-  Goal  World  Cycles  (GWC):  the  number  of  world  cycles  where  the  sheep 
were  actively  driving  the  sheep  towards  a  gate,  as  opposed  to  rounding  up 
breakaway  sheep. 

-  Breakaway  World  Cycles  (BWC):  the  number  of  world  cycles  where  one 
or  more  sheep  were  separated  from  the  flock. 

We  then  used  this  data  to  calculate  what  we  refer  to  as  an  average  performance 
measure  and  an  average  efficiency  measure. 

The  average  performance  measure  was  calculated  simply  as  the  number  of 
total  world  cycles  averaged  over  the  30  runs.  This  gave  an  average  “time”  taken 
by  the  agent  to  achieve  its  task  -  the  less  time  taken  the  better  the  performance. 

The  average  efficiency  measure  was  calculated  using  the  formula 

100  *GWC 
BWC+GWC 

to  obtain  a  score  for  each  run.  These  scores  were  then  averaged  over  the  thirty 
runs.  This  was  intended  to  capture  how  well  the  dogs  were  maintaining  control 
over  the  sheep.  The  higher  this  number  the  better  the  dogs  were  effectively 
maintaining  control  over  their  world. 

As  well  as  collecting  statistics  we  recorded  observations  for  a  number  of 
randomly  selected  test  runs.  When  these  runs  were  short  they  were  observed 
from  start  to  finish.  Longer  runs  (some  over  six  hours  duration)  were  observed 
at  random  intervals. 
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4  Results 

In  observing  the  test  runs  the  plan-based  dogs  appeared  to  have  a  controlling 
behaviour,  while  the  behaviour  based  dogs  appeared  opportunistic,  reacting  to 
opportunities  that  arose,  rather  than  actively  using  a  strategy  to  direct  the 
sheep.  As  the  breakaway  frequency  increased  both  sets  of  agents  appeared  to 
lose  control  of  the  sheep. 

We  show  our  analysis  of  the  statistical  data  in  terms  of  the  two  basic  ques¬ 
tions  we  are  addressing:  How  adaptable  are  the  two  different  agent  types  to  an 
increasingly  unpredictable  world?  How  adaptable  are  the  agents  to  changes  in 
speed  of  the  environment  relative  to  their  own  speed? 


4.1  Adaptability  to  unpredictability 

As  the  unpredictability  of  the  environment  increased  the  performance  of  both 
plan-based  and  behaviour  based  agents  decreased  exponentially.  Figure  1  shows 
the  increasing  number  of  world  cycles  taken  to  complete  the  task  for  both  types 
of  agents  as  numbers  of  breakaways  increases.  This  figure  also  shows  the  per¬ 
formance  of  the  control  group  of  sheep  without  any  dogs  herding  them.  Clearly 
both  experimental  agent  types  are  achieving  better  results  than  the  sheep  in  the 
control  group  with  no  dog  agents  (p  <  0.005).  Both  agent  types  appear  to  have 
almost  identical  deterioration  patterns  under  decreasing  predictability. 


Average  World  Cycles  Required  To  Complete  Task 
(Logarithmic  Scale) 


Plan  Based  Dogs  -«•  Behavior  Based  Dogs  ■•♦■■Sheep  Only  (No  Dogs) 


Fig.  1.  Performance  under  varying  unpredictability. 
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Looking  at  the  initial  part  of  the  curve  (0-4  breakaways  per  1000  world 
cycles)  it  seems  that  both  the  agent  types  are  also  deteriorating  less  quickly 
than  the  control  group,  as  seen  in  the  slope  of  the  line.  However,  this  is  less 
interesting  than  the  fact  that  each  of  the  agent  types  appear  to  be  following  a 
very  similar  pattern. 

Using  the  measure  of  efficiency,  rather  than  performance  (i.e.  what  percentage 
of  agent  time  was  spent  driving  the  sheep  towards  the  gate,  as  opposed  to  doing 
the  subsidiary  task  of  managing  the  unpredictable  breakaways),  we  see  a  linear, 
rather  than  exponential  decline  as  the  world  becomes  more  unpredictable.  This  is 
shown  in  figure  2.  In  this  case  at  both  a  breakaway  rate  of  4:1000  and  at  20:1000 
the  behaviour  based  dogs  are  significantly  less  efficient  than  the  plan-based  dogs 
(0.05  <  p  >  0.005  for  4:1000  and  p  <  0.005  for  20:1000).  At  a  breakaway  rate 
of  50:1000  there  is  no  longer  any  significant  difference  between  the  dog  types, 
probably  indicating  that  at  this  level  of  unpredictability  chance  is  the  overriding 
factor. 


Average  Efficiency 


Plan  Based  Dogs  -  —  Behavior  Based  Dogs| 


Fig.  2.  Efficiency  under  varying  unpredictability. 


It  seems  odd  that  the  difference  in  efficiency  is  not  mirrored  by  a  difference 
in  performance.  During  observation  we  noticed  that  on  a  number  of  occasions 
the  plan-based  dogs  would  get  caught  in  a  circling  behaviour  as  they  tried  to 
drive  the  sheep  towards  the  gate.  This  was  essentially  a  bug  in  the  plan  set 
which  was  not  discovered  until  part  way  through  the  experimentation  and  we 
were  unable  to  rerun  the  experiments  with  this  bug  fixed.  This  needless  circling 
drives  up  the  number  of  total  world  cycles  which  would  cause  the  performance 
measure  to  decrease  while  causing  the  efficiency  measure  (which  is  a  ratio  of 
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cycles  where  the  dogs  are  driving  the  sheep  towards  a  gate  to  the  total  number 
of  world  cycles)  to  increase. 

Consequently  we  can  conclude  that  in  this  scenario  the  behaviour  based 
agents  are  not  more  robust  than  the  plan-based  agents  under  decreasing  pre¬ 
dictability  of  the  environment.  However,  it  is  not  possible  to  ascertain  whether 
or  not  the  plan-based  agents  are  more  robust. 

4.2  Adaptability  to  relative  speed  fluctuations 

Figure  3  shows  the  performance  results  as  the  speed  of  the  environment  changes 
with  respect  to  the  perception  speed  of  the  dogs.  The  most  striking  difference 
is  between  the  plan-based  and  behaviour  based  dogs  as  the  speed  of  the  world 
increases  with  respect  to  the  perception  speed  of  the  agents  (see  LHS  of  figure 
3).  The  behaviour  based  agents  appear  to  deteriorate  exponentially  for  both 
predictable  (0  breakaways)  and  somewhat  unpredictable  (20:1000  breakaways). 
Interestingly  the  plan-based  agents  in  the  predictable  environment  actually  im¬ 
prove  significantly  when  the  environment  speeds  up  by  a  factor  of  four  (0.05  > 
p  <  0.005),  remaining  relatively  stable  at  speed  up  rates  of  two  and  three.  With 
the  somewhat  unpredictable  environment  the  plan-based  agents  maintain  their 
performance  at  a  doubling  of  world  speed,  but  deteriorate  as  the  world  speeds 
up  by  factors  of  three  and  four. 

As  the  world  slows  down  relative  to  the  speed  of  the  dogs  (see  RHS  of  figure  3, 
we  see  a  much  more  stable  pattern.  The  plan-based  agents  at  both  predictability 
levels  maintain  a  relatively  constant  performance  as  the  world  becomes  up  to  four 
times  slower.  There  is  a  significant  improvement  (p  <  0.005)  in  the  performance 
of  the  behaviour  based  dogs  as  the  moderately  unpredictable  world  (20:1000 
breakaway  rate)  becomes  slower. 

Ideally  one  would  like  to  analyse  the  data  using  regression  analysis  to  de¬ 
termine  whether  the  change  function  for  the  plan  based  dogs  is  significantly 
different  to  that  of  the  behaviour  based  dogs  as  the  environment  speeds  up  and 
slows  down  at  the  different  predictability  levels.  However,  this  would  require 
a  minimum  of  five  measurements  on  each  side  of  the  center  line  and  so  is  not 
possible  to  do  with  the  current  data. 

Looking  at  the  efficiency  data  in  figure  4  we  see  that  in  the  moderately  un¬ 
predictable  environment  (20:1000  breakaways)  both  types  of  agents  deteriorate 
as  the  world  gets  both  slower  and  faster.  However,  it  appears  that  the  plan-based 
agents  deteriorate  less  rapidly  than  the  behaviour  based  agents  as  the  world  be¬ 
comes  faster,  while  the  opposite  effect  is  apparent  as  the  world  becomes  slower. 
Once  again  the  lack  of  sufficient  data  points  makes  it  impossible  to  do  regres¬ 
sion  analysis  which  would  be  the  most  appropriate  form  of  statistical  analysis. 
However,  by  shifting  the  data  by  a  fixed  amount  it  is  possible  to  obtain  a  com¬ 
mon  start  point  at  normal  world  speed  and  then  use  a  paired  two  sample  t-test. 
This  indicates  that  the  behaviour  based  dogs  are  significantly  better  (p  =  0.05) 
which  would  suggest  that  the  deterioration  functions  of  the  two  dog  types  are 
significantly  different.  Comparing  with  the  LHS  of  the  graph  we  see  that  if  the 
trend  of  the  behaviour  based  agents  continued  with  further  data  points  beyond 
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Average  World  Cycles  Required  To  Complete  Task 
(Logarithmic  Scale) 


Perception  Rate  as  a  Multiplier 

Where  the  Multiplier  indicates  how  many  perception/adion  cycles 
are  completed  for  each  world  cycle 


— *—  Plan  Based  Dogs  @  0  Breakaways 
— *—  Plan  Based  Dogs  @  20  Breakaways/1000  World  Cycles 

-  «-  Behavior  Based  Dogs  @  0  Breakaways 

-  Behavior  Based  Dogs  @  20  Breakaways/1 000  World  Cycles 


Fig.  3.  Performance  under  varying  rates  of  perception. 


the  two  obtained,  the  difference  would  be  greater  than  that  on  the  right  hand 
side  as  the  lines  diverge  more  quickly. 

In  a  predictable  world  (0  breakaways)  the  plan-based  agents  appear  to  be 
stable  until  the  world  is  speeded  up  by  a  factor  of  four,  where  they  start  to 
deteriorate.  They  are  also  stable  as  the  world  slows  down.  The  behaviour  based 
agents  deteriorate  as  the  world  slows  down  by  factors  of  two,  three  and  four.  The 
behaviour  based  agents  in  the  speeded  up  world  only  have  data  for  the  speed  up 
by  two,  and  at  this  point  they  are  stable. 

In  summary,  there  is  a  strong  suggestion  that  as  the  world  speeds  up  plan 
based  agents  are  more  stable  and  deteriorate  less  rapidly  than  behaviour  based 
agents.  As  the  world  slows  down  the  situation  is  less  clear  with  respect  to  differ¬ 
ences  between  the  agent  types  and  there  appears  to  be  far  less  effect  than  when 
the  world  speeds  up.  One  interesting  effect  in  a  slowed  down  world  is  that  if 
it  is  moderately  unpredictable  the  behaviour  based  agents  improve  significantly 
when  the  world  slows  down  by  a  factor  of  four. 

5  Discussion  and  Conclusion 

The  results  of  this  work  do  not  lend  support  to  the  idea  that  behaviour  based 
agents  are  more  robust  than  plan-based  agents  if  robust  is  defined  in  terms  of 
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Average  Efficiency 


Perception  Rate  as  a  Multiplier 

Where  the  Multiplier  indicates  how  many  perception/action  cycles 
are  completed  for  each  world  cycle 


— •— Plan  Based  Dogs  @  0  Breakaways 

— Plan  Based  Dogs  @  20  Breakaways/1 000  World  Cycles 

-  ■o-  Behavior  Based  Dogs  @  0  Breakaways 

-«■  Behavior  Based  Dogs  @  20  Breakaways/1 000  World  Cycles 


Fig.  4.  Efficiency  under  varying  rates  of  perception. 


changes  to  the  relative  speed  of  the  environment  and  to  the  unpredictability  of 
the  environment.  In  fact,  the  results  suggest  that  plan-based  agents  may  be  more 
robust  in  this  sense.  The  most  surprising  result  is  that  as  the  world  speeds  up 
the  behaviour-based  agents  appear  to  deteriorate  rapidly  while  the  plan-based 
agents  remain  stable  or  even  improve  slightly.  Possibly  this  is  related  to  the 
results  obtained  by  Kinny  [KG91]  which  showed  that  agents  which  were  able 
to  commit  to  a  plan  of  action  actually  did  better  as  the  world  became  more 
dynamic,  than  agents  that  had  no  such  ability. 

It  may  also  be  the  case  that  the  co-ordination  between  the  two  agents  that 
was  required  for  successful  herding  of  the  sheep  was  more  susceptible  to  environ¬ 
mental  perturbations  in  the  behaviour  based  program  where  it  “emerged”  due  to 
combining  of  simple  behaviours,  than  when  it  was  more  explicitly  represented. 

Questions  about  the  relative  robustness  of  different  software  paradigms  as 
environments  change  are  important  to  explore  given  that  the  digital  world  is 
changing  extremely  rapidly  and  that  systems  are  increasingly  likely  to  be  inter¬ 
acting  with  other  digital  systems  in  a  distributed  manner. 

Further  work  is  needed  to  collect  a  larger  number  of  data  points  in  particu¬ 
lar  to  allow  regression  analysis  to  determine  whether  the  deterioration  function 
differs  significantly  with  different  architectures.  Work  is  also  needed  to  explore 
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these  questions  in  an  environment  that  is  more  closely  related  to  application 
environments  rather  than  a  visual  simulation. 

Both  systems  exhibited  similar  degradation  as  unpredictability  increased. 
However,  it  would  be  interesting  to  further  explore  this  issue,  perhaps  iden¬ 
tifying  different  sorts  of  unpredictability  that  may  arise  and  whether  the  two 
approaches  have  different  stability  patterns  with  respect  to  differing  types  of 
unpredictability. 
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Abstract.  We  present  a  new  heuristic  method  to  evaluate  planning  sta¬ 
tes,  which  is  based  on  solving  a  relaxation  of  the  planning  problem.  The 
solutions  to  the  relaxed  problem  give  a  good  estimate  for  the  length  of 
a  real  solution,  and  they  can  also  be  used  to  guide  action  selection  du¬ 
ring  planning.  Using  these  informations,  we  employ  a  search  strategy 
that  combines  Hill-climbing  with  systematic  search.  The  algorithm  is 
complete  on  what  we  call  deadlock-free  domains.  Though  it  does  not  gu¬ 
arantee  the  solution  plans  to  be  optimal,  it  does  find  close  to  optimal 
plans  in  most  cases.  Often,  it  solves  the  problems  almost  without  any 
search  at  all.  In  particular,  it  outperforms  all  state-of-the-art  planners 
on  a  large  range  of  domains. 


1  Introduction 

The  standard  approach  to  obtain  a  heuristic  is  to  relax  the  problem  V  at  hand 
into  some  easier  problem  V' .  The  optimal  solution  length  to  a  situation  in  V' 
can  then  be  used  as  an  admissible  estimate  for  the  optimal  solution  length  of  the 
same  situation  in  V.  An  application  of  this  idea  to  domain  independent  planning 
was  first  used  in  the  HSP  system  [3],  The  planning  problem  V  is  relaxed  by  simply 
ignoring  the  delete  fists  of  all  operators.  However,  computing  the  optimal  solution 
length  for  a  planning  problem  without  delete  lists  is  still  NP-hard,  as  was  first 
shown  by  Bylander  [4].  Therefore,  the  HSP  heuristic  is  only  a  rough  estimate  of 
the  optimal  relaxed  solution  length.  In  short,  it  is  obtained  by  summing  up  the 
minimal  distances  of  all  atomic  goals. 

In  this  paper,  we  go  one  step  further.  We  introduce  a  method  that  computes 
some,  not  necessarily  optimal,  solution  to  the  relaxed  problem.  These  solutions 
are  helpful  in  two  ways: 

—  their  length  provides  an  informative  estimate  for  the  difficulty  of  a  situation; 

—  one  can  use  them  as  a  guidance  for  action  selection. 
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The  solution  length  estimates  are  used  to  control  a  local  search  strategy  si¬ 
milar  to  Hill-climbing,  which  is  combined  with  systematic  breadth  first  search 
in  order  to  escape  local  minima  or  plateaus.  The  guidance  information  is  em¬ 
ployed  to  cut  down  the  branching  factor  during  systematic  search.  The  method 
shows  good  behavior  over  all  domains  that  are  commonly  used  in  the  planning 
community.  In  particular,  we  will  see  that  it  is  complete  on  the  class  of  problems 
we  call  deadlock- free.  Performing  local  search,  the  method  can  not  guarantee  its 
solution  plans  to  be  optimal.  In  spite  of  this,  it  finds  close  to  optimal  plans  in 
most  cases.  As  a  benefit  from  the  severe  restriction  of  its  search  space,  it  shows 
very  competitive  runtime  behavior.  For  example,  logistics  problems  are  solved 
faster  than  by  any  other  domain  independent  planning  system  known  to  the 
author  at  the  time  of  writing. 


2  Background 

Throughout  the  paper,  we  consider  simple  STRIPS  domains.  We  briefly  review 
two  standard  notations.  An  action  o  has  the  form 

o  =  (  pre{o)  =>  add(o),  del(o) ) 

where  pre(o),  add(o)  and  del(o)  are  sets  of  ground  facts.  Plans  P  are  sequences 
P  =  (o i, . . . ,  o„)  of  actions,  i.e.,  we  consider  only  linear  plans. 

3  Heuristic 

In  this  section,  we  introduce  a  method  for  heuristically  evaluating  planning  states 
S.  Basically,  the  method  consists  of  two  parts. 

1.  First,  the  relaxed  fixpoint  is  built  on  S.  This  is  a  forward  chaining  process 
that  determines  in  how  many  steps,  at  best,  a  fact  can  be  reached  from  S, 
and  with  which  actions. 

2.  Then,  a  relaxed  solution  is  extracted  from  the  fixpoint.  This  is  a  sequence  of 
parallel  action  sets  that  achieves  the  goal  from  S,  if  their  delete  effects  are 
ignored. 

The  first  part  corresponds  directly  to  the  heuristic  method  that  is  used  in  HSP  [3] . 
The  second  part  goes  one  step  further:  while  in  HSP,  the  heuristic  is  extracted 
as  a  side  effect  of  the  fixpoint,  we  invest  some  extra  effort  to  find  a  relaxed 
plan,  and  use  the  plan  to  determine  our  heuristic  value.  The  fixpoint  process  is 
depicted  in  Figure  1. 

The  algorithm  can  be  seen  as  building  a  layered  graph  structure,  where  fact 
and  action  layers  are  interleaved  in  an  alternating  fashion.  The  process  starts 
with  the  initial  fact  layer,  which  are  the  facts  that  are  TRUE  in  S.  Then,  the  first 
action  layer  comprises  the  actions  whose  preconditions  are  contained  in  S.  The 
effects  of  these  actions  lead  us  to  the  second  fact  layer,  which,  in  turn,  determines 
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F0  :=  S 
fc  :=  0 

while  Q  g  Fk  do 

Ok  ■■=  {o  £  O  |  pre(o)  C  Fk} 
Fk+i  ■=  Fk  U  (jo  e  Qk  add{o) 
if  Fk+i  =  Fk  then 
break 
endif 
k  :=  k  +  1 
endwhile 
max  :=  k 


Fig.  1.  Computing  the  relaxed  fixpoint  on  a  planning  state  S.  O  and  Q  denote  the 
action  set  and  goal  state  of  the  problem  at  hand,  respectively. 


the  next  action  layer  and  so  on.  The  process  terminates,  and  remembers  the 
number  max  of  the  last  layer,  if  all  goals  are  reached  or  if  the  new  fact  layer  is 
identical  to  the  last  one. 

The  crucial  information  that  the  fixpoint  process  gives  us  are  the  levels  of 
all  facts  and  actions.  These  are  defined  as  the  number  of  the  first  fact-  or  action 
layer  they  are  members  of. 


{min{i  \  f  €  Ft }  ex.  i  :  f  €  Ff 
oo  otherwise 

{min{i  I  o  6  Cfi}  ex.  i  :  o  €  Oi 
oo  otherwise 

We  now  show  how  to  extract  a  relaxed  plan  from  the  fixpoint  structure.  This 
is  done  in  a  backward  chaining  manner,  where  we  simply  use  any  action  with 
minimal  level  to  make  a  goal  TRUE.  The  exact  algorithm  is  depicted  in  Figure  2. 
Note  that  we  do  not  need  to  search ,  we  can  proceed  right  away  to  the  initial 
state  and  are  guaranteed  to  find  a  solution. 

Before  plan  extraction  starts,  an  array  of  goal  sets  Gi  is  initialized  by  inserting 
all  goals  with  corresponding  level.  The  mechanism  then  proceeds  down  from  layer 
max  to  layer  1,  and  selects  an  action  o  for  each  goal  g  at  the  current  layer  i, 
incrementing  the  plan  length  counter  h.  No  actions  are  selected  for  goals  that 
are  marked  TRUE  at  the  time  being,  as  they  are  already  added.  The  achiever  o 
is  required  to  have  level  (o)  =  i  —  1.  This  is  minimal  as  the  goal  g  has  level  i, 
i.e.,  the  first  action  that  achieved  g  in  the  fixpoint  came  in  at  level  *  —  1.  The 
preconditions  of  o  are  inserted  as  new  goals  into  their  corresponding  goal  sets. 
If  the  current  layer  is  i,  then  the  levels  of  o’ s  preconditions  are  at  most  i  —  1,  so 
these  new  goals  will  be  made  TRUE  later  during  the  process. 
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for  i  :=  1, . . . , max  do 

Gi  ~  {g  eG  |  level(p)  =  i} 
endfor 
h  :=  0 

for  i  :=  max, . . . ,  1  do 

for  all  p  G  Gi,  g  not  true  at  i  do 

select  o  with  g  6  add{o)  such  that  level(o)  =  i  —  1 
h  :=  h  +  1 

for  all  /  £  pre(o),  f  not  TRUE  at  i  —  1  do 

^level(/)  :=  ^levelf/)  U  {•/"} 
endfor 

for  all  /  £  add(o)  do 

mark  /  as  TRUE  at  i  —  1  and  i 

endfor 

endfor 

endfor 


Fig.  2.  The  algorithm  that  extracts  a  relaxed  solution  to  a  state  S  after  the  fixpoint 
has  been  built. 


3.1  Goal  Distance 

To  obtain  the  heuristic  goal  distance  value  h(S)  of  a  given  planning  state  S, 
we  now  simply  chain  the  two  algorithms  together.  First,  we  perform  the  fixpoint 
computation  from  Figure  1.  If  the  process  terminates  without  reaching  the  goals, 
we  set  h(S)  oo.  Otherwise,  we  extract  a  relaxed  plan,  Figure  2,  and  use  the 
plan  length  for  evaluation,  i.e.,  h(S)  h. 

The  overall  structure  of  the  relaxed  planning  process  is  quite  similar  to  plan¬ 
ning  with  planning  graphs  [1].  It  amounts  to  a  very  special  case,  as  no  negative 
interactions  at  all  occur  between  facts  or  actions  in  the  relaxed  problem. 

3.2  Helpful  Actions 

We  can  also  use  the  extracted  plan  to  determine  a  set  of  actions  that  seem  to 
be  helpful  in  reaching  the  goal.  To  do  this,  we  turn  our  look  on  the  actions 
that  are  contained  in  the  first  time  step  of  the  relaxed  solution,  i.e.,  the  actions 
that  are  selected  at  level  0.  These  are  often  the  actions  that  are  useful  in  the 
given  situation.  Let  us  see  a  simple  example  for  that,  taken  from  the  gripper 
domain,  as  it  was  used  in  the  1998  AIPS  planning  systems  competition.  We  do 
not  repeat  the  exact  definition  of  the  domain  here,  as  it  is  easily  understood 
intuitively.  There  are  two  rooms,  A  and  B,  and  a  certain  number  of  balls,  which 
shall  be  moved  from  room  A  to  room  B.  The  planner  changes  rooms  via  the 
move  operator,  and  controls  two  grippers  which  can  pick  or  drop  balls.  Each 
gripper  can  only  hold  one  ball  at  a  time.  We  look  at  a  small  problem  where  2 
balls  must  be  moved  into  room  B.  A  relaxed  solution  to  the  initial  state  that 
our  heuristic  might  extract  is 
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<  {  pick  balll  A  left, 
pick  ball2  A  left, 
move  A  B  }, 

{  drop  balll  B  left, 
drop  ball2  B  left  }  > 

This  is  a  parallel  relaxed  plan  consisting  of  two  time  steps.  Note  that  the 
move  A  B  action  is  selected  parallel  to  the  pick  actions,  as  the  relaxed  planner 
does  not  notice  that  it  can  not  pick  balls  in  room  A  anymore  once  it  has  moved 
into  room  B.  In  a  similar  fashion,  both  balls  are  picked  with  the  left  gripper. 
Nevertheless,  two  of  the  three  actions  in  the  first  step  are  helpful  in  the  given 
situation:  both  pick  actions  are  starting  actions  of  an  optimal  sequential  solu¬ 
tion.  Thus,  one  might  be  tempted  to  define  the  set  H(S)  of  helpful  actions  as 
only  those  that  are  contained  in  the  first  time  step  of  the  relaxed  plan.  However, 
this  is  too  restrictive  in  some  cases.  We  therefore  define  our  set  H(S)  as  follows. 

H(S)  :=  {oeO0\  add(o )  n  G1  ^  0} 

After  plan  extraction,  Oq  contains  the  actions  that  are  applicable  in  S,  and  G1 
contains  the  facts  that  were  goals  or  subgoals  at  level  1.  Thus,  we  consider  as 
helpful  those  actions  which  add  at  least  one  fact  that  was  a  (sub)goal  at  the 
lowest  time  step  of  our  relaxed  solution. 

4  Search 

We  now  introduce  a  search  algorithm  that  makes  effective  use  of  the  heuristics 
we  defined  in  the  last  section.  The  key  observation  that  leads  us  to  the  method 
is  the  following.  On  some  domains,  like  the  gripper  problems  from  the  1998 
competition  and  Russel’s  tyreworld,  it  is  sufficient  to  use  our  heuristic  in  a  naive 
Hill-climbing  strategy.  In  these  problems,  one  can  simply  start  in  the  initial  state, 
pick,  in  each  state,  a  best  valued  successor,  and  ends  up  with  an  optimal  solution 
plan.  This  strategy  is  very  efficient  on  the  problems  where  it  finds  plans. 

However,  the  naive  method  does  not  find  plans  on  most  problems.  Usually, 
it  runs  into  an  infinite  loop.  To  overcome  this  problem,  one  could  employ  stan¬ 
dard  Hill-climbing  variations,  like  restarts,  limited  plateau  moves,  or  a  memory 
for  repeated  states.  We  use  an  enforced  Hill-climbing  method  instead,  see  the 
definition  in  Figure  3. 

The  algorithm  combines  Hill-climbing  with  systematic  breadth  first  search. 
Like  standard  Hill-climbing,  it  picks  some  successor  of  the  current  state  at  each 
stage  of  the  search.  Unlike  in  standard  Hill-Climbing,  this  successor  does  not 
need  to  be  a  direct  one,  and,  unlike  in  standard  Hill-Climbing,  we  do  not  pick 
any  best  valued  successor,  but  enforce  the  successor  to  be  one  that  is  strictly 
better  than  our  current  state. 

More  precisely,  at  each  stage  during  search  a  successor  state  is  found  by 
performing  breadth  first  search  starting  out  from  the  current  state  S.  For  each 
search  state  S',  all  successors  are  generated  and  evaluated  heuristically.  Doubly 
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initialize  the  current  plan  to  the  empty  plan  <> 

S  :=  X 

obtain  h(S)  by  evaluating  S 
if  h(S)  =  oo  then 

output  ”No  Solution”,  stop 

endif 

while  h(S)  ^  0  do 

breadth  first  search  for  a  state  S'  with  h(S')  <  h(S) 
if  no  such  state  can  be  found  then 
output  ”No  Solution”,  stop 

endif 

add  the  actions  on  the  path  to  S'  at  the  end  of  the  current  plan 
S  :=  S' 

endwhile 


Fig.  3.  The  Enforced  Hill-climbing  algorithm.  1  denotes  the  initial  state  of  the  problem 
to  be  solved. 


occuring  states  are  pruned  from  the  search  by  keeping  a  hashtable  of  past  states 
in  memory,  and  the  search  stops  as  soon  as  it  has  found  a  state  S'  that  has  a 
lower  heuristic  value  than  S.  This  way,  the  Hill-climbing  search  escapes  plateaus 
and  local  minima  by  simply  performing  exhaustive  search  for  an  exit,  i.e.,  a  state 
with  strictly  better  heuristic  evaluation. 


4.1  Helpful  Actions 

So  far,  we  have  only  used  the  goal  distance  heuristic.  We  integrate  the  helpful 
actions  heuristic  into  our  search  algorithm  as  follows.  During  breadth  first  search, 
we  do  not  generate  all  successors  of  any  search  state  S’  anymore,  but  consider 
only  those  that  are  obtained  by  applying  actions  from  H(S').  This  way,  the 
branching  factor  for  the  search  is  cut  down.  However,  considering  only  the  actions 
in  H(S')  might  make  the  search  miss  a  goal  state.  If  this  happens,  i.e.,  if  the 
search  can  not  reach  any  new  states  anymore  when  restricting  the  successors  to 
H(S'),  we  simply  switch  back  to  complete  breadth  first  search  starting  out  from 
the  current  state  S  and  generating  all  successors  of  search  nodes. 

5  Completeness 

The  Enforced  Hill-climbing  algorithm  is  complete  on  deadlock-free  planning  pro¬ 
blems.  We  define  a  deadlock  to  be  a  state  S  that  is  reachable  from  the  initial  state 
X,  and  from  which  the  goal  can  not  be  reached  anymore.  A  planning  problem  is 
called  deadlock- free,  if  it  does  not  contain  any  deadlock  state.  We  remark  that 
a  deadlock-free  problem  is  also  solvable,  cause  otherwise  the  initial  state  itself 
would  already  be  a  deadlock. 
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Theorem  1.  Let  P  be  a  planning  problem.  If  V  is  deadlock-free,  then  the  En¬ 
forced  Hill-climbing  algorithm,  as  defined  in  Figure  3,  will  find  a  solution. 

Due  to  space  restrictions,  we  do  not  show  the  (easy)  proof  of  Theorem  1  here 
and  refer  the  reader  to  [5].  In  short,  if  the  complete  breadth  first  search  starting 
from  a  state  S  can  not  reach  a  better  evaluated  state,  then,  in  particular,  it 
can  not  reach  a  goal  state,  which  implies  that  the  state  S'  is  a  deadlock  in 
contradiction  to  the  assumption. 

In  [5] ,  it  is  also  shown  that  most  of  the  currently  used  benchmark  domains  are 
in  fact  deadlock-free.  Any  solvable  planning  problem  that  is  invertible  in  the  sense 
that  one  can  find,  for  each  action  sequence  P,  an  action  sequence  P  that  undoes 
P’s  effects,  does  not  contain  deadlocks.  One  can  always  go  back  to  the  initial  state 
first  and  execute  an  arbitrary  solution  thereafter.  Moreover,  planning  problems 
that  contain  an  inverse  action  o  to  each  action  o  are  invertible:  simply  undo 
all  actions  in  the  sequence  P  by  executing  the  corresponding  inverse  actions. 
Finally,  most  of  the  current  benchmark  domains  do  contain  inverse  actions.  For 
example  in  the  blocksworld,  we  have  stack  and  unstack.  Similarly  in  domains 
that  deal  with  logistics  problems,  for  example  logistics,  ferry,  gripper  etc.,  one 
can  often  find  inverse  pairs  of  actions.  If  an  action  is  not  invertible,  its  role  in 
the  domain  is  often  quite  limited.  A  nice  example  is  the  inflate  operator  in 
the  tyreworld,  which  can  be  used  to  inflate  a  spare  wheel.  Obviously,  there  is 
not  much  point  in  defining  something  like  a  deflate  operator.  More  formally 
speaking,  the  operator  does  not  destroy  a  goal  or  a  precondition  of  any  other 
operator  in  the  domain.  In  particular,  it  does  not  lead  into  deadlocks. 

6  Empirical  Results 

For  empirical  evaluation,  we  implemented  the  Enforced  Hill-climbing  algorithm, 
using  relaxed  plans  to  evaluate  states  and  to  determine  helpful  actions,  in  C.  We 
call  the  resulting  planning  system  FF,  which  is  short  for  FAST-FORWARD  plan¬ 
ning  system.  All  running  times  for  FF  are  measured  on  a  Sparc  Ultra  10  running 
at  350  MHz,  with  a  main  memory  of  256  M  Bytes.  Where  possible,  i.e.,  for  those 
planners  that  are  publicly  available,  the  running  times  of  other  planners  were 
measured  on  the  same  machine.  We  indicate  run  times  taken  from  the  Literature 
in  the  text.  All  planners  were  run  with  the  default  parameters,  unless  otherwise 
stated  in  the  text,  and  all  benchmark  problems  are  the  standard  examples  taken 
from  the  Literature.  Some  benchmark  problems  have  been  modified  in  order  to 
show  how  planners  scale  to  bigger  instances.  We  explain  the  modifications  made, 
if  any,  in  the  text.  Dashes  indicate  that  the  corresponding  planner  failed  to  solve 
that  problem  within  half  an  hour. 

6.1  The  Logistics  Domain 

This  is  a  classical  domain,  involving  the  transportation  of  packets  via  trucks 
and  airplanes.  There  are  two  well  known  test  suites.  One  has  been  used  in  the 
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1998  AIPS  planning  systems  competition,  the  other  one  is  part  of  the  BLACKBOX 
distribution.  The  problems  in  the  competition  suite  are  very  hard.  In  fact,  they 
are  so  hard  that,  up  to  date,  no  planner  has  been  reported  to  solve  them  all. 
Fast-Forward  is  the  first  one  that  does.  See  Figure  4,  showing  also  the  results 
for  GRT  [12]  and  HSP-r  [2],  which  are — as  far  as  the  author  knows — the  two  best 
other  domain  independent  logistics  planners  at  the  time  being.1 
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Fig.  4.  Results  of  the  three  domain  independent  planners  best  suited  for  logistics  pro¬ 
blems  on  the  1998  competition  suite.  Times  are  in  seconds,  steps  counts  the  number 
of  actions  in  a  sequential  plan.  For  HSP-r,  the  weighting  factor  W  is  set  to  5,  as  was 
done  in  the  experiments  described  by  Bonet  and  Geffner  in  [2], 


The  times  for  GRT  in  Figure  4  are  from  the  paper  by  Refanidis  and  Vlahavas 
[12],  where  they  are  measured  on  a  Pentium  300  with  64  M  Byte  main  memory. 
FF  outperforms  both  HSP-r  and  GRT  by  an  order  of  magnitude.  Also,  it  finds 
shorter  plans  than  the  other  planners. 

We  also  ran  FF  on  the  benchmark  problems  from  the  BLACKBOX  distribution 
suite,  and  it  solved  all  of  them  in  less  than  half  a  second.  Compared  to  the  results 
shown  by  Bonet  and  Geffner  [2]  for  these  problems,  FF  was  between  2  and  10 
times  faster  than  HSP-r,  finding  shorter  plans  in  all  cases. 


1  It  is  important  to  distinct  the  results  shown  in  Figure  4  from  those  reported  earlier  for 
HSP-r  [2],  Those  results  were  taken  on  the  problems  from  the  blackbox  distribution, 
while  our  results  are  taken  on  the  1998  competition  test  suite. 
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6.2  Mixed  Classical  Problems 

Fast-Forward  shows  competitive  behavior  on  all  commonly  used  benchmark 
domains.  To  exemplify  this,  we  show  a  table  of  running  times  on  a  variety  of 
different  domains  in  Figure  5,  comparing  FF  against  a  collection  of  state-of-the- 
art  planning  systems:  IPP  [8],  STAN  [9],  BLACKBOX  [7],  and  HSP  [3]. 
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Fig.  5.  Running  times  and  quality  (in  terms  of  number  of  actions)  of  plans  for  FF  and 
state-of-the-art  planners  on  various  classical  domains.  All  planners  are  run  with  the 
default  parameters,  except  HSP,  where  loop  checking  needs  to  be  turned  on. 


In  Figure  5,  the  planning  problems  shown  are  the  following.  The  tyreworld 
problem  was  originally  formulated  by  Russell,  and  asks  the  planner  to  replace 
a  flat  tire.  The  problem  is  modified  in  a  natural  way  so  as  to  make  the  planner 
replace  n  flat  tires.  FF  is  the  only  planner  that  is  capable  of  replacing  more  than 
three  tires,  scaling  up  to  much  bigger  problems. 

The  hanoi  problems  make  the  planner  solve  the  well  known  Towers  of  Hanoi 
problem,  with  n  discs  to  be  moved.  FF  also  outperforms  the  other  planners  on 
these  problems. 

The  sokoban  problem  encodes  a  small  instance  of  a  well  known  computer 
game,  where  a  single  stone  must  be  pushed  to  its  goal  position.  Although  the 
problem  contains  deadlocks,  FF  has  no  difficulties  in  solving  it. 

The  manhattan  domain  was  first  introduced  by  McDermott  [10],  In  these 
problems,  the  planner  controls  a  robot  which  moves  on  a  n  X  n  grid  world,  and 
has  to  deal  with  different  kinds  of  keys  and  locks.  The  original  problem  taken 
from  [10]  corresponds  to  the  mh-11  entry  in  Tabular  5,  where  the  robot  moves 
on  a  11  x  11  grid.  The  other  entries  refer  to  problems  that  have  been  modified 
to  encode  7x7,  15x15  and  19  x  19  grid  worlds,  respectively.  FF  easily  handles 
all  of  them,  finding  slightly  suboptimal  plans. 

Finally,  the  blocksworld  problems  in  Figure  5  are  benchmark  examples  taken 
from  [6] .  FF  outperforms  the  other  planners  in  terms  of  running  time  as  well  as 
in  terms  of  solution  length. 
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7  Related  Work 

The  closest  relative  to  the  work  described  in  this  paper  is,  quite  obviously,  the 
HSP  system  [3].  In  short,  HSP  does  Hill-climbing  search,  with  the  heuristic  fun¬ 
ction 

h(S)  :=  ^  weight s(g) 
g£Q 

The  weight  of  a  fact  with  respect  to  a  state  S  is,  roughly  speaking,  the  mini¬ 
mum  over  the  sums  of  the  precondition  weights  of  all  actions  that  achieve  it. 
The  weights  are  obtained  as  a  side  effect  of  doing  exactly  the  same  fixpoint 
computation  as  we  do.  The  main  problem  in  HSP  is  that  the  heuristic  needs 
to  be  recomputed  for  each  single  search  state,  which  is  very  time  consuming. 
Inspired  by  HSP,  a  few  approaches  have  been  developed  that  try  to  cope  with 
this  problem,  like  HSP-r  [2]  and  the  GRT-planner  [12], 

The  authors  of  HSP  themselves  handle  the  problem  by  sticking  to  their  heu¬ 
ristic,  but  changing  the  search  direction,  going  backwards  from  the  goal  in  HSP-r 
instead  of  forward  from  the  initial  state  in  HSP.  This  way,  they  need  to  compute 
a  weight  value  for  each  fact  only  once,  and  simply  sum  the  weights  up  for  a  state 
later  during  search. 

The  authors  of  [12]  invert  the  direction  of  the  HSP  heuristic  instead.  While 
HSP  computes  distances  by  going  towards  the  goal,  GRT  goes  from  the  goal  to 
each  fact,  and  estimates  its  distance.  The  function  that  then  extracts,  for  each 
state  during  forward  search,  the  state’s  heuristic  estimate,  uses  the  pre-computed 
distances  as  well  as  some  information  on  which  facts  will  probably  be  achieved 
simultaneously. 

For  the  Fast-Forward  planning  system,  a  somewhat  paradoxical  extension 
of  HSP  has  been  made.  Instead  of  avoiding  the  major  drawback  of  the  HSP  stra¬ 
tegy,  we  even  worsen  it,  at  first  sight:  the  heuristic  keeps  being  fully  recomputed 
for  each  search  state,  and  we  even  put  some  extra  effort  on  top  of  it,  by  extrac¬ 
ting  a  relaxed  solution.  However,  the  overhead  for  extracting  a  relaxed  solution 
is  marginal,  and  the  relaxed  plans  can  be  used  to  prune  unpromising  branches 
from  the  search  tree. 

To  verify  where  the  enormous  run  time  advantages  of  FF  compared  to  HSP 
come  from,  we  ran  HSP  using  Enforced  Hill-climbing  search  with  and  without 
helpful  actions  pruning,  as  well  as  FF  without  helpful  actions  on  the  problems 
from  our  test  suite.  Due  to  space  restrictions,  we  can  not  show  our  findings 
in  detail  here.  It  seems  that  the  major  steps  forward  are  our  variation  of  Hill¬ 
climbing  search  in  contrast  to  the  restart  techniques  employed  in  HSP,  as  well 
as  the  helpful  actions  heuristic,  which  prunes  most  of  the  search  space  on  many 
problems.  Our  different  heuristic  distance  estimates  seem  to  result  in  shorter 
plans  and  slightly,  about  a  factor  two,  better  running  times,  when  one  compares 
FF  to  a  version  of  HSP  that  uses  Enforced  Hill-climbing  search  and  helpful  actions 
pruning.  We  did  not  yet  find  the  time  to  do  these  experiments  the  other  way 
round,  i.e.,  integrate  our  heuristic  into  the  HSP  search  algorithm,  as  this  would 
involve  modifying  the  original  HSP  code,  which  means  a  lot  of  implementation 
work. 


226  J.  Hoffmann 


There  has  been  at  least  one  more  approach  in  the  Literature  where  goal 
distances  are  estimated  by  ignoring  the  delete  lists  of  the  operators.  In  [10], 
Greedy  Regression-Match  Graphs  are  introduced.  In  a  nutshell,  these  estimate 
the  goal  distance  of  a  state  by  backchaining  from  the  goals  until  facts  are  reached 
that  are  TRUE  in  the  current  state,  and  then  counting  the  estimated  minimal 
number  of  steps  that  are  needed  to  achieve  the  goal  state. 

To  the  best  of  our  understanding,  the  action  chains  that  lead  to  a  state’s 
heuristic  estimate  in  [10]  are  similar  to  the  relaxed  plans  that  we  extract.  Ho¬ 
wever,  the  backchaining  process  seems  to  be  quite  costly.  For  example,  building 
the  Greedy  Regression-Match  Graph  for  the  initial  state  of  the  manhattan  world 
11  x  11  grid  problem  is  reported  to  take  25  seconds  on  a  Sparc  2  station.  For 
comparison,  we  ran  FF  on  a  Sparc  4  station.  Finding  a  relaxed  plan  for  the 
initial  state  takes  less  than  one  hundredth  of  a  second,  i.e.,  the  time  measured 
is  0.00  CPU  seconds. 

The  helpful  actions  heuristic  shares  some  similarities  with  what  is  known  as 
relevance  from  the  literature  [11].  The  main  difference  is  that  relevance  in  the 
usual  sense  refers  to  what  is  useful  for  solving  the  whole  problem.  Being  helpful, 
on  the  other  hand,  refers  to  something  that  is  useful  in  the  next  step. 

8  Conclusion  and  Outlook 

In  this  paper,  we  presented  two  heuristics  for  domain  independent  STRIPS  plan¬ 
ning,  one  estimating  the  distance  of  a  state  to  the  goal,  and  one  collecting  a 
set  of  promising  actions.  Both  are  based  on  an  extension  of  the  heuristic  that  is 
used  in  the  HSP  system.  We  showed  how  these  heuristics  can  be  used  in  a  va¬ 
riation  of  Hill-climbing  search,  and  we  have  seen  that  the  algorithm  is  complete 
on  the  class  of  deadlock-free  domains.  We  collected  empirical  evidence  that  the 
resulting  planning  system  is  among  the  fastest  planners  in  existence  nowadays, 
outperforming  the  other  state-of-the-art  planners  on  quite  a  range  of  domains, 
like  the  logistics,  manhattan  and  tyreworld  problems. 

To  the  author,  the  most  exciting  question  is  this:  Why  is  the  heuristic  in¬ 
formation  obtained  in  this  simple  manner  so  good?  It  is  not  really  difficult  to 
construct  abstract  examples  where  the  approach  produces  arbitrarily  bad  plans, 
or  uses  arbitrarily  much  time,  so  why  does  it  almost  never  go  wrong  on  the 
benchmark  problems?  Why  is  the  relaxed  solution  always  so  close  to  a  real  solu¬ 
tion,  except  for  the  Tower  of  Hanoi  problems?  Is  it  possible  to  define  a  notion  of 
“simple”  planning  domains,  where  relaxed  solutions  have  desirable  properties? 

First  steps  into  that  direction  seem  to  indicate  that,  in  fact,  there  might  be 
some  underlying  theory  in  that  sense.  In  particular,  it  can  be  proven  that  the 
Enforced  Hill-climbing  algorithm  finds  optimal  solutions  when  the  heuristic  used 
is  goal-directed,  in  the  following  sense: 

h(S)  <  h(S')  =>  min(S)  <  min(S') 

Here,  min(S)  denotes  the  length  of  the  shortest  possible  path  from  state  S 
to  a  goal  state,  i.e.,  Enforced  Hill-climbing  is  optimal  when  heuristically  better 
evaluated  states  are  really  closer  to  the  goal. 
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It  can  also  be  proven  that  the  length  of  an  optimal  relaxed  solution  is,  in  fact, 
a  goal-directed  heuristic  in  the  above  sense  on  the  problems  from  the  gripper 
domain  that  was  used  in  the  1998  planning  systems  competition.  We  have  not 
yet,  however,  been  able  to  identify  some  general  structural  property  that  implies 
goal-directedness  of  optimal  relaxed  solutions. 

Apart  from  these  theoretical  investigations,  we  want  to  extend  the  algorithms 
to  handle  richer  planning  languages  than  STRIPS,  in  particular  ADL  and  resource 
constrained  problems. 

Acknowledgments.  The  author  thanks  Bernhard  Nebel  for  helpful  discussions 
and  suggestions  on  designing  the  paper. 
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Abstract.  We  propose  a  planning  architecture  where  the  planner  and 
the  executor  interact  with  each  other  in  order  to  face  dynamic  changes 
of  the  application  domain.  According  to  the  deferred  planning  strategy 
proposed  in  [14],  a  plan  schema  is  produced  off-line  by  a  generative  con¬ 
straint  based  planner  and  refined  at  execution  time  by  retrieving  up-to- 
date  information  when  that  available  is  no  longer  valid.  In  this  setting, 
both  planning  and  execution  can  be  seen  as  search  processes  in  the  space 
of  partial  plans.  We  exploit  the  Interactive  Constraint  Satisfaction  fra¬ 
mework  [12]  which  represents  an  extension  of  the  Constraint  Satisfaction 
paradigm  for  dealing  with  incomplete  knowledge.  Given  the  uncertainty 
of  the  plan  execution  in  dynamic  environments,  a  backup  and  recovery 
mechanism  is  necessary  in  order  to  allow  backtracking  at  execution  time. 


1  Introduction 

In  dynamic  and  changing  environments,  a  plan  produced  off-line  by  a  traditional 
generative  planner  can  fail  during  execution  due  to  the  fact  that  the  environment 
can  change,  often  in  unpredictable  ways.  In  particular,  our  planner  works  in 
a  networked  computer  system  environment  and  assembles  configuration  plans. 
The  information  about  system  services  and  resources  cannot  be  complete  at 
planning  time  due  to  its  vastity  and  dynamicity.  In  these  cases  it  is  impossible 
to  produce  a  complete  successful  plan  at  plan  generation  time  and  there  is  need 
to  sense  correct  and  up-to-date  information  at  execution  time  in  order  to  refine 
the  plan.  In  [14],  the  authors  propose  a  classification  followed  by  a  deep  analysis 
of  the  main  planning  strategies  able  to  integrate  execution  time  sensory  data 
into  the  planning  process.  The  strategy  we  follow  is  called  deferred  planning 
consisting  in  delaying  until  execution  the  decisions  depending  on  sensing.  As  a 
consequence,  there  is  need  for  a  sensing  mechanism  for  testing  the  environment 
and  a  procedure  for  plan  refinement  at  execution  time.  We  integrate  a  constraint- 
based  planner  aimed  at  producing  a  plan  schema  with  an  executor  able  to  refine 
the  plan  before  executing  it.  Both  those  components  are  able  to  sense  the  real 
world  by  means  of  a  constraint  based  framework,  called  Interactive  Constraint 
Satisfaction  Problem  (ICSP),  proposed  in  [12],  The  main  point  of  this  paper  is  to 
describe  our  architecture  and  how  it  implements  the  deferred  planning  strategy 
by  exploiting  the  ICSP  framework. 
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2  Different  Strategies  to  Cope  with  Dynamicity 

The  enhanced  complexity  of  traditional  planning  techniques  when  applied  to 
dynamic  environments  is  due  to  the  facts  that  (i )  typically  the  planner  is  not 
the  only  agent  that  causes  changes  on  the  system  and  (ii)  often  changes  are  not 
deterministic.  This  can  lead  to  a  failure  of  the  plan  execution,  either  because 
action  preconditions  are  no  longer  verified  at  execution  time,  or  because  action 
effects  are  not  those  expected. 

In  [14],  the  authors  present  three  different  extensions  to  conventional  plan¬ 
ning  techniques  whose  aim  is  to  cope  with  uncertainty: 

—  planning  for  all  contingencies,  so  that  once  sensing  is  performed,  only  the 
plan  correspondent  to  the  actual  contingency  will  be  executed  [15,2]; 

—  making  assumptions,  so  that  planning  decisions  will  be  based  only  on  the 
assumed  value  of  the  sensing  result  [5,9] ; 

—  deferring  planning  decisions  until  information  depending  on  sensors  is  avai¬ 
lable  [14,5,8]. 

The  appropriateness  of  the  strategy  depends  on  the  application,  and,  in  particu¬ 
lar,  on  the  criticality  of  mistakes,  on  the  complexity  of  the  domain,  and  on  the 
acceptability  of  suspending  execution  to  do  more  planning. 

Our  architecture  follows  the  deferred  planning  strategy,  as  it  will  be  descri¬ 
bed  in  the  next  section.  The  deferred  planning  approach  aims  at  avoiding  doing 
useless  computation  at  planning  time.  Some  portions  of  the  plan  requiring  in¬ 
formation  which  can  be  available  only  at  execution  time  are  left  incomplete.  In 
this  way  the  planner  could  miss  some  important  dependencies  between  the  par¬ 
tial  plans  it  is  producing.  This  is  why  plan  execution  can  fail  and  it  is  strongly 
required  that  the  actions  contained  in  partially  specified  plans  are  reversible. 

3  An  ICSP-Based  Planning  Architecture 

Our  planning  architecture  is  in  charge  of  computing  configuration  plans  in  a  net¬ 
worked  computer  system  [4,10].  The  domain  knowledge  is  composed  by  many 
different  types  of  objects  (e.g.,  machines,  users,  printers,  services,  files,  proces¬ 
ses),  their  attributes  (e.g.,  sizes,  availability,  location)  and  relations  among  them 
(e.g.,  user  u  is  logged  on  machine  m).  In  this  case,  there  is  an  enormous  amount 
of  knowledge  to  consider.  In  addition,  this  information  can  change  during  the 
system’s  life  due  to  actions  performed  on  the  objects  (e.g.,  removing  or  creating 
files,  connecting  or  disconnecting  machines,  adding  or  deleting  users,  starting 
or  killing  processes).  Thus,  it  is  not  convenient,  if  possible  at  all,  to  store  all 
this  information  in  advance  and  keep  it  up-to-date.  We  developed  a  planner  able 
to  deal  with  dynamic  and  incomplete  knowledge.  Our  solution  follows  the  de¬ 
ferred  planning  approach  described  in  [14].  However,  while  in  [14]  the  deferred 
decisions  are  represented  by  all  the  goals  involving  data  that  must  be  obtained 
through  sensing,  our  planner  does  not  defer  until  execution  all  the  goals  which 
require  sensing  since  it  is  able  to  sense  at  planning  time.  In  our  approach  deferred 
decisions  are  represented  by: 
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—  non  deterministic  variable  bindings:  since  variable  domain  values  represent 
alternative  resources  whose  state  can  change  during  or  after  plan  construc¬ 
tion,  we  want  to  avoid  as  much  as  possible  to  commit  to  premature  choices; 

—  acquisition  of  up-to-date  information  when  that  sensed  at  planning  time  is 
no  longer  valid. 

Both  the  planner  and  the  executor  are  able  to  sense  the  real  system  by  means 
of  a  constraint  based  framework  representing  an  extension  of  the  Constraint 
Satisfaction  paradigm  and  called  Interactive  Constraint  Satisfaction  Problem 
framework  [12]. 


3.1  Preliminaries 

Interactive  Constraints  (ICs)  are  declarative  relations  among  variables  whose 
domain  (i.e.,  the  set  of  values  the  variables  can  assume)  is  possibly  partially  or 
completely  unknown.  An  interactive  domain  is  defined  as  D(X)  —  [ListllUndef] 
where  List  represents  the  set  of  known  values  for  variable  X,  and  Undef  is  a 
domain  variable  itself  representing  (intensional)  information  which  is  not  yet 
available  for  variable  X.  An  Interactive  Constraint  Satisfaction  Problem  (ICSP) 
is  defined  on  a  set  of  variables  ranging  on  interactive  domains.  Variables  are 
linked  by  ICs  that  define  (possibly  partially  known)  combinations  of  values  that 
can  appear  in  a  consistent  solution.  As  for  traditional  Constraint  Satisfaction 
Problems,  a  solution  to  an  ICSP  is  found  when  all  the  variables  are  instantiated 
consistently  with  constraints.  For  a  formal  definition  of  the  ICSP  framework  see 
[12].  ICs  operational  behaviour  extends  standard  constraint  propagation  with  a 
data  acquisition  mechanism  devoted  to  retrieving  consistent  values  for  variable 
domains.  In  particular  given  a  binary  interactive  constraint  IC(c(X,Y)),  its 
operational  behaviour  is  the  following: 

1 .  If  both  variables  are  associated  to  a  partially  or  completely  unknown  domain, 
the  constraint  is  suspended; 

2.  else,  if  both  variables  range  on  a  completely  known  domain,  the  constraint 
is  propagated  as  in  classical  CSPs; 

3.  else,  if  one  variable  (say  X)  ranges  on  a  fully  known  domain  and  the  other 
(' Y )  is  associated  to  a  fully  unknown  domain  a  knowledge  acquisition  step 
is  performed;  this  returns  either  a  finite  set  of  consistent  values  representing 
the  domain  of  Y,  or  an  empty  set  representing  failure. 

4.  else,  if  X  ranges  on  a  fully  known  domain  and  Y  is  associated  to  a  partially 
known  one,  Y  domain  is  pruned  from  values  non  consistent  with  X.  If  Y 
domain  becomes  empty  a  new  knowledge  acquisition  step  is  performed  for 
Y  driven  by  X. 

This  is  a  general  framework  which  can  be  used  in  many  applications.  It  is  parti¬ 
cularly  suited  for  all  the  applications  that  process  a  large  amount  of  constrained 
data  provided  by  a  lower  level  system,  see  for  instance  [11,12]. 
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3.2  The  Algorithm 

According  to  the  deferred  planning  strategy  proposed  in  [14],  a  plan  schema  is 
produced  off-line  by  a  generative  planning  process  and  refined  at  execution  time 
by  retrieving  up-to-date  information  when  that  available  is  no  longer  valid.  In 
this  setting,  both  planning  and  execution  represent  search  processes  in  the  space 
of  partial  plans.  More  precisely  the  plan  execution  can  be  seen  as  the  second 
phase  of  the  same  search  algorithm  aimed  both  at  producing  and  executing  a 
plan.  The  generative  phase  of  the  algorithm  represents  a  Partial  Order  Plan¬ 
ner  (POP)  [16]  interleaving  open  condition1  achievement  and  conflict  resolution 
steps.  As  far  as  the  open  condition  achievement  is  concerned,  three  alternative 
cases  are  possible:  (i)  the  open  condition  is  already  satisfied  in  the  initial  state, 
(ii)  it  can  be  satisfied  by  an  action  already  in  the  plan,  (iii)  there  is  need  of  a 
new  action  in  order  to  satisfy  it. 

The  planning  problem  is  mapped  onto  an  ICSP  so  that  the  planner  becomes 
able  to  both  exploit  constraint  satisfaction  techniques  in  order  to  reduce  the  se¬ 
arch  space  and  deal  with  incomplete  knowledge.  The  method  we  propose  embeds 
knowledge  acquisition  activity  into  the  constraint  solving  mechanism,  thus  sim¬ 
plifying  the  planning  process  in  two  points.  First  of  all,  there  is  no  need  to  add 
declarative  sensing  actions  to  the  plan  [1,6,10],  we  provide  a  sensing  mechanism 
where  no  further  declarative  action  is  needed  apart  from  the  causal  actions.  Se¬ 
cond,  only  significant  information  for  the  planner  is  retrieved.  As  a  consequence, 
variable  domains  are  significantly  smaller  than  in  the  standard  case. 

Open  conditions  are  treated  as  ICs.  Variables  appearing  in  ICs  represent 
system  resources,  and  domain  values  represent  alternative  instances.  Variable 
domains  contain  all  the  known  alternative  resources;  they  can  be  either  ( i )  com¬ 
pletely  known,  containing  objects  which  can  be  assigned  to  the  corresponding 
variable;  (ii)  partially  known,  containing  some  values  already  at  disposal  and  a 
variable  representing  intensional  future  acquisitions;  (iii)  totally  unknown,  when 
no  information  has  already  been  retrieved  for  the  variable.  As  soon  as  an  open 
condition  p(X,  Y)  is  selected,  the  constraint  solver  will  propagate  the  correspon¬ 
ding  IC(p(X ,  Y))  to  test  if  there  exists  at  least  one  value  of  X  and  Y  that  already 
satisfies  p  in  the  initial  state.  When  variables  (A,  Y)  range  on  known  domains, 
traditional  constraint  propagation  is  performed  in  order  to  prune  inconsistent 
values  from  domain,  otherwise  constraint  propagation  results  in  acquisition  of 
domain  values.  In  order  to  provide  Interactive  Constraints  with  the  capability  to 
sense  the  system  we  need  to  associate  them  with  appropriate  information  gathe¬ 
ring  procedures,  working  as  access  modules  to  the  real  world.  In  our  environment 
such  procedures  can  be  represented  by  simple  UNIX  sensing  commands  as  well 
as  by  scripts  when  sensory  requests  involve  setup  activities.  It  is  worth  noting 
that  when  appropriate  sensors  are  available,  Interactive  Constraint  retrieve  only 
information  consistent  with  the  context  so  as  to  simplify  the  task  of  pruning 
inconsistent  alternatives.  For  instance,  suppose  that,  during  the  planning  pro¬ 
cess,  we  need  to  locate  a  file  mydoc  in  a  UNIX  system,  i.e.,  we  need  to  propagate 

1  An  open  condition  is  indifferently  represented  by  a  precondition  or  a  final  goal  con¬ 
junct  still  to  be  satisfied. 
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the  interactive  constraint  inDirectory(mydoc,  Location).  Suppose,  also,  that  the 
file  mydoc  is  initially  contained  in  three  different  directories  dir  1,  dir2  and  dirZ. 
If  variable  Location  has  an  unknown  domain,  an  acquisition  step  is  performed 
and  those  three  values  are  retrieved  (through  the  find  Unix  sensing  command), 
otherwise  the  domain  is  pruned  from  not  consistent  values  (e.g  dir  A). 

If  a  constraint  fails  (i.e.,  a  variable  domain  becomes  empty),  it  means  that 
the  corresponding  precondition  is  not  satisfied  in  the  initial  state  (i.e.,  there  is 
need  of  an  action  in  order  to  achieve  it).  On  the  other  hand,  when  more  than 
one  value  are  left  in  a  variable  domain  after  all  possible  propagation,  it  means 
that  all  those  values  satisfy  that  constraint  in  the  initial  state.  In  a  traditional 
CS.based  approach,  there  is  need  for  a  non  deterministic  labelling  step  in  order 
to  find  a  final  solution.  In  our  architecture,  the  labelling  step  takes  place  at  plan 
execution  time  so  that  at  the  end  of  the  generative  phase  variables  might  be 
associated  with  a  domain  containing  more  than  one  value. 

Given  the  plan  schema  produced  by  the  generative  phase,  the  executor  selects 
the  first  action  to  be  executed.  An  interactive  constraint  propagation  activity 
checks  the  satisfability  of  its  preconditions  in  the  real  world.  If  precondition  varia¬ 
bles  are  already  instantiated,  the  interaction  with  the  underlying  system  results 
in  a  consistency  check,  while  if  those  variables  are  associated  to  a  domain,  the 
domain  can  be  pruned  in  order  to  remove  values  which  are  no  longer  consistent 
with  the  current  state  of  the  system.  Value  removal  can  trigger  constraint  propa¬ 
gation  which,  in  turn,  removes  values  from  other  variable  domains,  thus  reducing 
the  execution  search  space.  If,  after  propagation,  a  domain  is  empty,  meaning 
that  values  retrieved  at  planning  time  no  longer  verify  the  correspondent  precon¬ 
dition  p,  a  backtracking  step  is  performed  in  order  to  select  an  alternative  action 
or  partial  plan  which  satisfies  p.  When  all  the  variables  of  the  action  range  on 
non  empty  domains,  necessary  non  deterministic  labelling  steps  are  performed 
and  the  action  is  executed.  The  same  reasoning  applies  until  all  the  actions  are 
successfully  executed. 

3.3  An  Example 

Let  us  consider  a  network  where  a  monitoring  application  ensures  that  certain 
processes  are  up  and  running  (i.e.  that  their  status  is  on).  Once  the  status  of 
the  system  is  recognized  as  faulty,  the  planner  is  activated  in  order  to  provide  a 
recovery  plan. 

Let  us  suppose  that  one  of  those  processes,  called  Trigger,  is  off,  and  that 
for  activating  it  the  planner  generates  the  plan  Pi  of  actions  shown  in  Figure 
1.  Note  that  some  domains  are  partially  known,  others  are  still  completely  un¬ 
known.  TriggerStart  is  the  daemon  process  in  charge  of  activating  the  Trigger 
process,  and  its  code  is  contained  in  the  executable  file  TMAboot.  When  activating 
TriggerStart,  TMAboot  must  be  located  in  a  directory  (X)  corresponding  to  the 
so-called  runlevel  (I)  of  the  process.  For  instance,  if  TriggerStart  is  to  be  ac¬ 
tivated  at  runlevel  '  3  ’ ,  TMAboot  must  be  in  a  directory  called  ’  /sbin/ rl3  ’ .  The 
runlevel  is  a  parameter  of  the  machine  which  is  set  at  boot-time.  In  particular,  in 
order  to  achieve  the  goal  of  having  processes  TriggerStart  and  Trigger  on,  Pi 
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*****  plan  to  be  executed:  ***** 
killProcess (TriggerStart) 

copy (TMAboot,  Dl,  X);  X::  [UndefJ  D1 : :  [/sbin/rll,  /sbin/rl2,  Undef] 
onTrigger Start (I ,  X);  X::  [Undef]  I::  [3,  Undef] 


Fig.  1.  Plan  to  be  executed. 


suggests  that  process  TriggerStart  is  killed,  that  file  TMAboot  is  copied  from 
directory  Dl  to  directory  X  and  that  process  TriggerStart  is  activated  from 
the  directory  X  corresponding  to  the  runlevel  I.  At  planning  time  only  relevant 


*****  executing  plan. . .  ***** 

now  checking  preconditions . . . 

- >  condition  status (TriggerStart,  on)  succeeded 

. . .preconditions  checked. 

now  doing  labelling  on  preconditions . . . 

labelling  killProcess (TriggerStart) 

. . . labelling  on  preconditions  done . 

now  executing  action  1:  killProcess (TriggerStart) .. . 

- >  action  killProcess (TriggerStart)  succeeded 

now  checking  preconditions. . . 

- >  condition  inDirectory (TMAboot,  Dl)  succeeded 

. . .preconditions  checked. 

now  doing  labelling  on  preconditions . . . 

labelling  copy (TMAboot,  Dl,  X) 

...labelling  on  preconditions  done. 

now  executing  action  2:  copy (TMAboot,  /sbin/rll,  X)... 

- >  action  copy (TMAboot,  /sbin/rll,  /sbin/rl3)  succeeded 

now  checking  preconditions . . . 

- >  condition  status (TriggerStart,  off)  succeeded 

- >  condition  inDirectory (TMAboot,  /sbin/rl3)  succeeded 

- >  condition  configDir (/sbin/rl3,  I)  succeeded 

. . .preconditions  checked. 

now  doing  labelling  on  preconditions . . . 

labelling  onTriggerStart (I ,  /sbin/rl3) 

...labelling  on  preconditions  done. 

now  executing  action  3:  onTriggerStart (3 ,  /sbin/rl3) . . . 

- >  action  onTriggerStart (3,  /sbin/rl3)  succeeded 

*****  . , .plan  executed  ***** 


Fig.  2.  Output  messages  generated  during  the  execution  of  the  plan:  case  1. 


facts  are  retrieved  from  the  world:  in  particular,  the  planner  knows  that  the 
machine  is  on  at  runlevel  3,  that  four  directories  (’ /sbin/rlO’ ,  ’/sbin/rll’, 
’/sbin/rl2’,  ’/sbin/rl3’)  exist  and  that  they  correspond  to  four  different 
runlevels,  that  process  TriggerStart  is  on  and  that  process  Trigger  is  off. 
Finally  it  knows  that  a  copy  of  the  file  TMAboot  is  contained  in  two  different  di¬ 
rectories  ( ’  /sbin/rll  ’ ,  ’  /sbin/rl2  ’ ).  If  the  world  does  not  change  the  executor 
will  instantiate  variable  Dl  either  to  ’/sbin/rll’  or  to  ’/sbin/rl2’,  variable 
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now  checking  preconditions . . . 

- >  condition  inDi rectory (TMAboot,  Dl)  succeeded 

. . .preconditions  checked. 

now  doing  labelling  on  preconditions . . . 

labelling  copy (TMAboot,  Dl ,  X) 

. . .labelling  on  preconditions  done. 

now  executing  action  2:  copy (TMAboot,  /sbin/rl2,  X)... 

- >  action  copy (TMAboot,  /sbin/rl2,  /sbin/rl3)  succeeded 


Fig.  3.  Output  messages  generated  during  the  execution  of  action  copy:  case  2. 


X  to  Vsbin/rl3’  and  I  to  3.  The  output  messages  generated  by  the  execution 
module  are  those  of  Figure  2.  We  can  recognize  different  steps  in  the  execution 
of  each  action:  a  first  phase  where  the  executor  checks  if  the  preconditions  of  the 
current  action  hold,  a  labelling  phase  where  domains,  if  any,  are  labelled  and 
eventually  an  execution  phase,  which  modifies  the  state  of  the  world. 

If  the  actions  are  successfully  performed,  the  world  is  led  to  a  final  state  with 
all  the  relevant  processes  on.  Now,  let  us  suppose  that  before  executing  action 
copy(TMAboot,  Dl,  X)  some  external  agent  in  the  world  deletes  file  TMAboot 
from  Vsbin/rll’.  The  actual  world  contains  only  one  instance  of  such  file, 
in  directory  ’  /sbin/rl2 1 .  Therefore  the  executor  cannot  label  the  plan  in  the 
same  way  as  before  (i.e.,  copy(TMAboot,  /sbin/rll,  /sbin/rl3)).  What  it 
does,  after  checking  in  the  world  the  domain  of  Dl,  is  to  choose  one  of  the  do¬ 
main  values  which  are  actually  left  (i.e.,  ’/sbin/rl^’).  The  execution  proceeds 
as  in  the  first  case,  with  the  only  difference  that  the  file  is  copied  from  a  different 
source  (Vsbin/rl2’).  See  Figure  3. 

4  Non-monotonic  Changes 

Up  to  now,  we  have  considered  that  values  acquired  during  plan  construction 
can  be  no  longer  available  during  plan  execution.  However,  a  more  complex 
situation  occurs  when  some  new  values  are  available  during  plan  execution  and 
have  not  been  retrieved  during  plan  construction.  Standard  CSPs  do  not  deal 
with  value  insertion  in  variable  domains  since  it  implies  reconsidering  previously 
deleted  values  which  can  be  supported  by  the  newly  inserted  value.  The  ICSP 
framework  can  cope  with  non  monotonic  changes  of  variable  domains  thanks  to 
the  open  domains  data  structure. 

An  open  domain  is  represented  by  a  set  of  known  values  and  a  variable  repre¬ 
senting  the  unknown  domain  parts,  i.e.,  potential  future  acquisitions.  Thus,  if, 
during  plan  execution,  the  entire  set  of  known  values  (those  acquired  during  plan 
construction),  is  deleted  because  of  precondition  verification,  a  new  acquisition 
can  start  aimed  at  retrieving  new  consistent  values.  If  no  values  are  available, 
backtracking  is  performed  in  order  to  explore  the  execution  of  other  branches 
in  the  search  space  of  partial  plans.  If  the  overall  process  fails  (and  only  in  this 
case),  a  re-planning  is  performed. 
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Dynamic  Constraint  Satisfaction  (DCS)  [13]  has  been  proposed  in  order  to 
deal  with  non  monotonic  changes.  DCP  solvers  maintain  proper  data  structures 
so  as  to  tackle  modifications  of  the  constraint  store.  Thanks  to  the  ICSP  frame¬ 
work  we  do  not  need  to  store  additional  information  for  restoring  the  constraint 
store  consistency  as  done  by  DCS  approaches.  On  the  other  hand,  our  method 
makes  the  propagation  we  perform  less  powerful  than  that  performed  by  dyna¬ 
mic  approaches.  In  fact,  if  we  consider  a  constraint  between  variables  X  and  Y, 
the  variable  inserted  in  the  domain  of  variable  X  represents  a  potential  support 
for  values  in  the  domain  of  variable  Y,  which  cannot  be  pruned  until  the  domain 
of  X  becomes  closed. 

Example.  Given  the  example  above  let  us  consider  a  third  case.  If  the  domain 
initially  retrieved  for  D1  is  completely  wiped  and  an  instance  of  file  TMAboot 
is  put  in  another  directory,  let  us  say  ’  /sbin/rlO  ’ ,  it  is  necessary  to  perform 
more  acquisition  via  the  undefined  part  of  the  D1  domain.  In  particular,  the  con¬ 
straints  active  on  D1  can  shape  from  the  new  world  setting  the  correct  domain 
for  it  at  execution  time  and  let  the  plan  be  once  more  successfully  executed. 
Figure  4  shows  the  output  generated  by  the  execution  of  action  copy. 


now  checking  preconditions. . . 

- >  condition  inDirectory (TMAboot ,  /sbin/rlO)  succeeded 

. . .preconditions  checked. 

now  doing  labelling  on  preconditions . . . 

labelling  2 

labelling  copy (TMAboot ,  /sbin/rlO,  X) 

. . . labelling  on  preconditions  done . 

now  executing  action  2:  copy (TMAboot ,  /sbin/rlO,  X)... 

- >  action  copy (TMAboot ,  /sbin/rlO,  /sbin/rl3)  succeeded 


Fig.  4.  Output  messages  generated  during  the  execution  of  action  copy:  case  3. 


5  Conclusion 

This  paper  describes  an  approach  to  deferred  planning,  which  represents  one 
of  the  main  planning  strategy  to  plan  in  presence  of  dynamic  environments. 
The  idea  is  to  delay  some  planning  decisions  regarding  sensory  data,  as  much 
as  possible,  in  order  to  reduce  the  gap  between  the  world  as  it  is  observed  at 
planning  time  and  the  world  the  executor  performs  on.  We  exploit  the  Interactive 
Constraint  Satisfaction  framework  [12],  which  represents  an  extension  of  the  CS 
framework  based  on  Interactive  Constraints,  in  order  to  interact  with  the  real 
world.  Sensing  is  performed  both  at  planning  and  at  execution  time. 

The  implementation  of  this  architecture  has  been  carried  out  by  using  the 
finite  domain  library  of  ECLlPSe  [3]  properly  extended  to  cope  with  the  in¬ 
teractive  framework.  ECLilPSe  is  a  Constraint  Logic  Programming  (CLP)  [7] 
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system  merging  all  the  features  and  advantages  of  Logic  Programming  and  Con¬ 
straint  Satisfaction  techniques.  CLP  on  Finite  Domains,  CLP(FD),  can  be  used 
to  represent  planning  problems  as  CSPs. 

A  repair  mechanism  is  currently  under  development  in  order  to  cope  with 
failures  and  backtracking  steps  over  already  executed  actions.  The  repair  me¬ 
chanism  supports  all  cases  in  which  the  executor  realises  that  the  effects  of  the 
action  are  not  those  expected. 
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Abstract.  As  human  society  glows  large  and  complex  problems  which 
human  being  must  solve  is  also  becoming  large  and  complex.  In  many  ca¬ 
ses,  a  problem  must  be  solved  cooperatively  by  many  people.  There  arise 
a  problem  of  decomposing  the  problem  into  sub-problems,  distributing 
these  sub-problems  to  number  of  persons  and  organizing  these  people  in 
such  a  way  that  the  problem  can  be  solved  most  efficiently.  This  orga¬ 
nization  is  not  universal  but  is  made  specific  to  the  given  problem.  It  is 
possible  to  create  a  multi-agent  system  to  correspond  to  the  cooperative 
work  by  persons.  Here  is  a  problem  of  creating  an  organization  of  the 
agents  dynamically  that  is  suited  for  coping  with  the  specific  problem. 

It  is  the  major  objective  of  this  paper  to  discuss  a  way  of  generating  a 
multi-agent  system  with  examples. 

1  Introduction 

As  social  systems  grow  large  and  complex,  problems  which  human  being  must 
solve  are  also  becoming  large  and  complex.  In  many  cases,  a  problem  must  be 
solved  cooperatively  by  many  people.  There  arise  a  problem  of  decomposing 
the  problem  into  sub-problems,  distributing  these  sub- problems  to  number  of 
persons  and  organizing  these  people  in  such  a  way  that  the  problem  can  be  solved 
most  efficiently.  Managing  this  process  has  been  one  of  very  important  tasks  by 
human  being.  But  the  growth  of  the  scale  of  problems  induces  the  increase  of 
complexity  in  their  management  tasks  and  it  is  worrying  that  the  current  method 
of  managing  the  process  is  inadequate  for  following  up  the  growth  of  the  problem 
scale.  We  are  required  today  to  develop  a  new  method  of  management  to  resolve 
this  problem. 

One  of  the  reasons  that  makes  the  current  method  inadequate  is  that  large 
amount  of  decision  were  distributed  to  number  of  persons  who  join  the  process 
and  made  there  without  being  recorded  fully.  Large  part  of  these  decisions  re¬ 
mains  in  worker’s  brain,  and  the  manager  cannot  follow  the  development  process 
afterward  for  checking  them.  This  problem  remains  unresolved  as  far  as  the  main 
body  of  development  is  person  because  it  is  caused  by  an  intrinsic  nature  of  hu¬ 
man  being.  An  alternate  way  to  improve  this  situation  is  to  introduce  computers 
in  problem  solving  much  more  than  ever  before  and  let  them  record  the  history 
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of  its  process,  especially  the  history  of  decisions  made  there  by  persons  in  this 
process. 

We  introduce  computers  as  software  agents.  It  follows  that  agents  replace 
persons  in  a  organization  for  problem  solving. 

Many  papers  argue  about  problem  solving  system  by  software  agents  [1]  [2]  [8] . 
In  this  paper,  Multi-strata  model[5]  is  used  to  describe  problem  solving  process, 
and  multi-agent  systems  are  created  based  on  this  model. 

There  are  some  problems.  How  is  the  multi-agent  system  organization  ge¬ 
nerated  and  managed?  How  is  the  human  decision  recorded?  How  is  the  past 
record  found  and  used  for  new  problem  solving?  And  so  on. 

Some  of  these  issues,  especially  a  way  of  generating  a  multi-agent  system, 
are  discussed  in  this  paper.  Every  agent  in  this  system  is  intelligent  in  a  sense 
that  not  only  it  can  solve  problems  autonomously  based  on  a  knowledge  base 
but  also  it  generates  the  other  intelligent  agents  as  needed. 

2  Problem  Solving 

2.1  Problem  Solving  Scheme 

Problems  can  be  divided  roughly  into  two  types;  design  type  and  analysis  type. 
Design  type  problem  is  define  as  those  to  obtain  a  structure  of  object  with  the 
required  function  while  analysis  type  problem  is  to  obtain  the  functionality  of 
object  based  on  a  given  structure.  As  will  be  described  in  the  following,  design 
type  problem  solving  is  defined  as  a  repetitive  operation  including  analysis  type 
problem.  In  this  paper  therefore  design  type  problem  is  mainly  discussed. 

A  basic  operation  for  design  type  problem  solving  is  represented  in  this  paper 
roughly  as  composed  of  three  stages  as  follows  and  is  shown  in  Fig.  1. 


Fig.  1.  Standard  process  of  design  problem  solution 

1.  Make  an  incipient  model  as  an  embodiment  of  person’s  idea  on  problem 
solving.  It  includes  this  person’s  requirement  to  be  satisfied. 

2.  Analyze  the  model  to  obtain  its  functionality  and  behavior,  and  evaluate  it 
whether  it  satisfies  the  person’s  requests. 

3.  Based  on  the  result  of  the  analysis  and  evaluation,  modify  the  model. 

If  the  request  is  satisfied,  stop  the  design  process.  The  model  represents  a 
solution. 
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2.2  Solving  Large  Problem  by  Persons 

When  a  problem  is  very  large  and  is  solved  by  person,  this  basic  process  cannot 
be  applied  directly  but  the  problem  must  be  decomposed  to  a  set  of  smaller  sub¬ 
problems  and  these  sub-problems  are  distributed  to  the  different  persons.  Since 
these  sub-problems  are  not  specified  in  advance  but  generated  by  decomposition 
after  given  the  original  problem,  these  persons  cannot  be  assigned  in  advance 
but  must  be  generated  dynamically  in  parallel  with  decomposition  process.  This 
method  is  shown  in  the  case  of  aircraft  design  as  an  example.  It  is  shown  in  Fig. 
2. 
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Fig.  2.  Design  process  of  an  aircraft 


The  outline  of  the  process  is  shown  as  follows. 

1.  Model  generation 

Given  requirement  from  client,  a  chief  designer  prepares  to  generate  an  inci¬ 
pient  generous  model  of  the  whole  airplane  based  on  his/her  experiences  and 
referring  to  the  case  base.  First,  he/she  creates  a  top  node  of  a  hierarchy  to 
represent  the  object  and  gives  it  the  design  requirement. 

2.  Model  decomposition 

Design  starts  top-down.  In  general,  a  complex  object  is  decomposed  to  a 
number  of  assemblies  and  each  assembly  is  further  decomposed  to  a  set  of 
sub-assemblies  and  so  on.  In  this  way  a  hierarchical  object  is  generated.  It  is 
required  that  the  functionality  of  designed  object  meets  the  given  functional 
requirement.  The  functionality  of  an  object  is  decided  by  the  functionality 
of  its  components  and  their  structure.  In  this  sense  a  model  to  represent  an 
object,  called  an  object  model  hereafter,  is  characterized  by  the  bottom-up 
dependency.  But  it  is  difficult  to  design  a  large  object  bottom-up  because 
of  the  combinatorial  explosion  of  computations.  Instead,  a  top-down  design 
must  be  performed  actually. 

The  designer  responsible  to  design  an  object  decides  tentatively  an  upper 
part  structure  of  this  object,  i.e.  main  components  and  their  structure.  For 
example,  an  aircraft  designer  decide  tentatively  first  main  components  (as¬ 
semblies)  such  as  engine,  main  wing,  fuselage,  ladder  wing,  tail  wing,  vertical 
wing,  landing  gear,  wire-harness,  electronic  system,  etc.  Then  their  structural 
relation  is  defined.  By  assuming  the  functionality  of  these  components,  the 
functionality  of  the  object  can  be  estimated.  If  this  tentative  structure  does 
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not  satisfy  the  given  requirement,  the  designer  has  to  find  another  structure 
or  changes  the  functionality  of  the  components.  The  functionality  becomes 
the  requirement  to  the  component  design  in  the  following  process. 

3.  Model  assignment  to  component-design 

If  the  designer  satisfies  the  estimate,  then  he/she  fixes  tentatively  this  part 
of  design  and  distributes  a  problem  of  designing  each  component  to  an  ex¬ 
pert  of  component  design.  For  example,  the  engine  design  is  assigned  to  an 
engine  expert.  After  then,  the  similar  process  is  performed  for  designing  each 
component  by  the  expert  assigned  in  this  way.  Thus  many  people  commit 
the  design  of  the  common  objects.  Since  behavior  of  these  components  are 
related  closely  to  each  others,  the  components  design  cannot  go  indepen¬ 
dently  from  the  others  but  needs  very  close  interactions.  Usually  therefore 
those  people  are  organized  to  assure  easy  communication  and  cooperation. 
That  many  people  join  the  same  design  means  that  decisions  are  distributed 
to  the  different  persons  and  remain  there  without  being  recorded.  It  causes 
the  difficulty  of  tracing  afterward  the  design  for  checking  and  maintenance. 
It  will  also  be  very  much  troubled  in  the  document  acquisition  in  modeling, 
if  previous  record  is  imperfect. 

3  Outline  of  Autonomous  Problem  Solving 

3.1  Problem  Solving  System  Architecture 

This  human-centered  process  is  replaced  by  computer-centered  process.  The 
computer-centered  process  means  that  a  computer  system  manages  a  total  pro¬ 
cess  and  persons  join  the  problem  solving  in  parts.  In  this  computer  system  an 
object  problem  is  represented  by  a  knowledge  representation  language  and  a 
knowledge  processing  agents  deal  with  the  problem  cooperatively.  The  agents  is 
organized  into  a  multi-agent  system  in  which  distributed  agents  are  related  each 
other  in  the  best  way  to  solve  the  given  specific  problem.  The  structure  of  the 
agents  corresponds  to  the  human  organization  in  the  human  problem  solving 
discussed  in  section  2.  The  major  parts  of  the  system  are  a  global  knowledge 
base,  a  distributed  problem  solving  system  composed  of  plural  agents  and  user 
interface.  The  global  knowledge  base  supports  knowledge  necessary  for  problem 
solving  to  every  agent.  The  overall  structure  of  agents  for  problem  solving  is 
shown  in  Fig.  3. 

3.2  KAUS  as  Knowledge  Representation  Language 

A  language  suited  for  representing  this  system  is  necessary.  In  order  to  cope  with 
problem  model  of  the  form  as  will  be  discussed  in  section  2,  it  must  be  suited  for 
representing  predicate  including  data-structure  as  argument  and  also  for  descri¬ 
bing  meta-level  operation  such  as  knowledge  for  describing  on  other  knowledge. 
KAUS  (Knowledge  Acquisition  and  Utilization  Language)  has  been  developed 
for  the  purpose[10].  In  the  following,  some  logical  expressions  appear  as  know¬ 
ledge.  In  order  to  keep  consistency  and  integrity  of  expressions  throughout  the 
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Fig.  3.  Agent  architecture  of  problem  solving 

whole  system  these  must  be  written  in  KAUS  language.  However,  these  are  not 
necessarily  written  in  correct  KAUS  expressions  but  locally  simplified.  It  is  be¬ 
cause  KAUS  syntax  is  not  included  in  this  volume  and  these  locally  simplified 
expressions  are  more  comprehensive  than  correct  expressions. 


4  Multi-agent  Problem  Solving  System 

4.1  Design  Principle  of  an  Agent 


Fig.  4.  Structure  of  an  Agent 


An  autonomous  problem  solving  system  is  designed  as  a  multi-agent  system. 
An  agent  in  this  multi-agent  system  is  designed  not  as  a  special  purpose  agent 
to  achieve  a  special  role  but  a  general-purpose  problem  solving  system  that  can 
accept  and  cope  with  any  problem  as  far  as  it  is  represented  by  a  modeling 
scheme  developed  for  the  system.  Every  agent  does  not  have  any  object  know¬ 
ledge  (knowledge  related  to  a  specific  object)  beforehand.  It  retrieves  necessary 
knowledge  from  global  knowledge  base  when  problem  is  assigned  and  just  before 
to  start  problem  solving.  This  agent  has  three  layers:  one  for  solving  actually  the 
given  problem,  second  for  generating  the  problem  solving  system  by  retrieving 
necessary  knowledge  from  the  global  knowledge  base  and  third  for  generating 
the  other  agent  (Fig.  4).  Throughout  this  paper,  it  is  considered  that  an  agent 
is  a  subject  of  an  activity.  Therefore,  if  some  activity  is  defined  in  a  computer 
system,  there  must  be  some  agent  to  behave  as  the  subject  of  the  activity. 
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4.2  Behavior  of  Agents 

The  behavior  of  agents  is  similar  to  persons  working  together  as  has  been  discus¬ 
sed  in  section  2.  Design-type  problems  are  kept  in  mind.  When  a  problem-solving 
task  starts,  a  highest-level ligent  is  prepared  that  corresponds  to  a  human  chief 
designer.  Receiving  user’s  requirement,  it  tries  to  decompose  an  object  top-down 
as  a  first  step  to  generate  an  object  structure.  It  analyze  and  evaluate  the  struc¬ 
ture  and  if  it  is  decided  that  it  succeeds  in  making  the  object  structure  at  the 
highest  level,  then  it  generates  and  assigns  an  agent  to  every  component  of  this 
structure.  In  this  way  the  problem  solving  proceeds  top-down.  When  it  reaches 
the  components  at  the  bottom,  that  is,  the  components  that  are  no  more  neces¬ 
sary  to  be  decomposed,  then  the  process  stops  in  success. 


(|  (designA  A)  "(designX  X)  "(designY  Y) 

"(mergeModel  A  X  Y)). 

(designX  10). 

(designY  5). 

(|  (mergeModel  A  X  Y)  *($add  A  X  Y)). 


Fig.  5.  Knowledge  of  designA 

The  behavior  of  an  agent  in  this  process  is  explained  first  using  a  simplest 
example  as  shown  in  Fig.  5.  This  is  a  problem  of  designing  an  imaginary  object 
A.  It  is  to  merge  the  results  from  two  designs  for  X  and  Y.  In  reality  these 
designs  are  to  obtain  the  entities  X  and  Y  of  which  the  results  are  given  as  the 
number  10  and  5  respectively.  The  merge  operation  is  to  add  the  numbers.  A 
design  problem  starts  given  the  requirement  (designA  A)?  First  an  agent  at  the 
highest  level  is  created.  It  is  given  the  requirement.  The  agent  tries  to  satisfy  the 
requirement.  It  refers  to  knowledge  base  and  finds  knowledge  which  says  that 
the  requirement  can  be  satisfied  by  achieving  two  designs  X  and  Y,  and  merging 
them.  This  means  that  three  new  activities  are  defined  and  accordingly  that 
there  should  be  three  subjects  for  the  activities.  Traditionally  these  activities 
are  combined  into  a  program  because  each  new  activity  is  very  simple.  This  is  a 
case  in  which  the  same  single  computer  represents  the  three  subjects.  But  there 
can  be  different  ways  when  sub-problems  are  large.  For  example,  every  subject 
can  be  different  from  each  other  and  an  agent  is  created  to  represent  a  subject. 
For  the  purpose  of  explanation,  a  design  process  is  presented  assuming  this  last 
case  in  the  following.  In  the  following  explanation,  the  sentences  headed  by  * 
shows  the  case  of  applying  the  general  method  to  this  specific  example. 

1.  An  agent  receives  the  problem  from  user  interface.  The  agent  becomes  the 
highest-level  agent.  The  agent  analyzes  the  problem,  takes  out  some  relating 
knowledge  from  the  knowledge  base,  and  saves  it  in  the  local  knowledge  base 
within  the  agent. 

*  An  agent  receives  ‘(designA  A)?’  Agent  Controller  in  the  agent  retrieves 
related  knowledge  from  Global  Knowledge  Base,  and  records  it  into  Local 
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Knowledge  Base.  In  this  case  it  is  ‘(|  (designA  A)  "(designX  X)  "(designY  Y) 
"(mergeModel  A  X  Y))’ 

2.  The  agent  selects  knowledge  to  use  for  problem  solving  from  the  local  know¬ 
ledge  base,  and  analyzes  its  structure  to  decide  whether  the  problem  should 
be  decomposed  or  not.  If  the  object  is  to  be  decomposed,  then  an  agent 
is  created  corresponding  to  every  component.  New  requirement  is  given  to 
every  new  agent. 

If  there  is  no  suitable  knowledge  in  local  knowledge  base,  the  agent  requires 
other  knowledge  via  user  interface. 

*  Inference  engine  selects  knowledge  ‘(|  (designA  A)  '(designX  X)  "(designY 
Y)  "(mergeModel  A  X  Y))’  in  knowledge  base.  The  object  model  A  is  decom¬ 
posed  to  X  and  Y.  Agents  X  and  Y  are  created  corresponding  to  the  objects 
X  and  Y  respectively. 

3.  Assign  each  problem  to  the  lower  agents.  After  then,  the  lower  agents  are 
activated. 

*  The  agent  assigns  the  requirements  ‘(designX  X)?’  and  ‘(designY  Y)?’  to 
the  agent  X  and  Y  respectively. 

4.  The  higher-level  agent  receives  solution  from  the  lower  level  agent.  Using  the 
result  problem  solving  continues  there.  If  the  solution  is  not  obtained,  the 
problem  solving  is  carried  out  again  using  different  knowledge. 

*  The  higher-level  agent  receives  X=10  and  Y=5  from  the  lower  level  agents. 
‘(mergeModel  A  10  5)7’  is  solved  there  to  obtain  A=15. 

5.  The  solution  is  returned  to  the  user  interface.  Inference  engine  replaces  the 
variable  part  of  the  knowledge  with  solutions  of  sub-problems  and  problem, 
and  Agent  Controller  registers  it  in  Case  Base. 

*  ‘A=15’  is  returned  to  user  interface. 

Interactions  between  agents  are  very  important,  but  it  is  a  very  large  pro¬ 
blem.  Our  group  still  study  about  agents’  interactions^],  this  system  has  not 
yet  realized  interaction  between  agents  that  has  same  higher  agent. 

5  Knowledge  Base 

The  knowledge  base  is  required  to  retrieve  an  appropriate  knowledge  for  the 
requests  from  agents  in  a  short  time  for  assuring  the  practicality  of  the  system. 
Since  large  amount  of  knowledge  from  various  types  and  domains  is  saved,  the 
knowledge  base  must  be  well  managed  by  a  knowledge  management  system.  The 
knowledge  is  divided  into  chunks  by  type  information,  domain  information  and 
the  other  information  for  aiding  rapid  retrieval  of  knowledge.  These  chunks  are 
structured  in  the  large  knowledge  base.  The  large  knowledge  base  management 
system  is  itself  a  special  agent.  It  accepts  the  request  from  the  other  agents, 
retrieve  the  required  knowledge  and  send  it  back  to  the  requesting  agent. 
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6  Overall  System  Architecture 

An  overall  system  architecture  is  shown  in  Fig.  6. 
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Fig.  6.  System  architecture 


It  consists  of  many  computers  on  a  network.  The  group  of  these  computer 
systems  is  organized  dynamically  into  a  distributed,  multi-agent  system  speci¬ 
fied  to  a  given  problem.  In  principle,  every  computer  is  the  same  and  symmetric 
to  each  other  except  a  large  knowledge-base  management  system  that  is  speci¬ 
fically  designed  to  achieve  the  specific  role.  In  order  for  agents  to  cooperate  to 
each  other,  CORBA  is  introduced.  Since  every  agent  works  on  KAUS  system,  a 
CORBA  extension  named  KAUS-CORBA  was  developed. 

7  Experiment 

A  slightly  more  complex  problem  than  that  used  for  explanation  in  section  5  has 
been  solved  as  an  example.  This  is  to  design  a  private  house  [3].  In  this  example, 
a  house  M  is  composed  of  such  components  as  movementLine  MO,  equipment 
Ml,  livingSpace  M2,  privateRoom  M3.  A  part  of  knowledge  that  corresponds  to 
‘(|  (designA  A)  "(designX  X)  "(designY  Y)  "(mergeModel  A  X  Y))’  used  in  the 
example  before  is  given  as  follows. 

([  (design-house  M)  "(design-movementLine  MO)  "(design-equipment  Ml) 
"(design-livingSpace  M2)  "(design-privateRoom  M3) 

"(merge  M  [MO  Ml  M2  M3])). 


(|(design-privateRoom  Y)  "(wall  YO)  "(pdoor  Yl)  "(pwindow  Y2)  "(pfloor  Y3) 
"(pceiling  Y4)  "(merge  Y  [YO  Yl  Y2  Y3  Y4])). 


(pfloor  wooden),  (pfloor  plastic),  (pfloor  tatami).  (pfloor  carpet),  (pfloor  stone). 


Fig.  7.  A  part  of  knowledge  to  design  a  house 

This  system  was  applied  to  design  of  the  house.  The  arrangement  of  rooms 
is  passed  to  top  agent  as  requirement,  agents  select  all  parts  of  house. 
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The  problem  of  design  the  house  was  divided  into  the  sub-problems  of  design 
the  parts;  the  different  agents  were  assigned  to  the  sub-problems  and  the  desi¬ 
gned  the  parts  were  merged  to  obtain  the  all  of  the  house.  Every  sub-problem 
was  solved  by  the  different  agents. 

The  knowledge  base  includes  various  alternative  rules  and  it  was  confirmed 
that  depending  on  the  problem  the  way  for  decomposing  the  problem  was  also 
changed,  the  different  organization  of  the  agents  is  also  generated  and  the  results 
of  the  past  trials  were  used  effectively. 

8  Conclusion 

In  this  paper,  it  was  discussed  a  way  of  solving  large  problems  in  a  distributed 
multi-agent  system.  A  problem  is  decomposed  into  sub-problems  and,  depending 
on  this  decomposition  of  the  problem,  agents  are  generated.  These  agents  keep 
relation  to  each  other  for  cooperation  depending  on  the  relations  of  sub-problems 
and  therefore  a  multi-agent  system  is  formed  tailored  to  this  specific  problem. 
An  agent  is  intelligent  in  the  sense  that  it  can  solve  various  type  of  problem 
autonomously,  and  also  it  can  create  the  other  agent  as  needed.  A  basic  idea, 
a  way  of  problem  solving,  also  a  way  of  generating  a  multi-agent  system,  two 
experiments  using  very  simple  examples  were  included  in  this  paper.  This  system 
is  a  part  of  a  larger  system  the  author’s  group  is  developing  now.  This  is  a  very 
large  system  and  there  remain  many  problems  to  be  solved.  The  part  discussed 
in  this  paper  is  a  central  portion  of  the  ideas  on  this  system  development. 
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Abstract.  There  is  current  interest  in  generalizing  Bayesian  networks 
by  using  dependencies  which  are  more  general  than  probabilistic  conditio¬ 
nal  independence  (Cl).  Contextual  dependencies,  such  as  context- specific 
independence  (CSI),  are  used  to  decompose  a  subset  of  the  joint  distri¬ 
bution.  We  have  introduced  a  more  general  contextual  dependency  than 
CSI,  as  well  as  a  more  general  noncontextual  dependency  than  CL  We 
developed  these  probabilistic  dependencies  based  upon  a  new  method 
of  expressing  database  dependencies.  By  defining  database  dependencies 
using  equivalence  relations,  the  difference  between  the  various  contextual 
and  noncontextual  dependencies  can  be  easily  understood.  Moreover,  this 
new  representation  of  dependencies  provides  a  convenient  tool  to  readily 
derive  other  results. 


1  Introduction 

Bayesian  networks  [5]  have  become  an  established  framework  for  uncertainty 
management  in  artificial  intelligence.  Bayesian  networks  only  use  a  single  type 
of  dependency,  called  probabilistic  conditional  independence  (Cl),  to  losslessly 
decompose  a  joint  probability  distribution.  There  is  current  interest,  however, 
in  generalizing  Bayesian  networks  with  more  general  dependencies.  In  [1],  a  con¬ 
textual  (horizontal)  dependency,  called  context-specific  independence  (CSI),  was 
introduced  to  capture  CIs  that  only  hold  in  some  of  the  tuples  in  a  joint  distri¬ 
bution.  In  [8] ,  we  introduced  a  more  general  contextual  dependency  than  CSI,  as 
well  as  a  more  general  noncontextual  dependency  than  Cl.  The  important  point, 
however,  is  that  our  probabilistic  dependencies  were  motivated  by  corresponding 
database  dependencies. 

Weak  multivalued  dependency  (WMVD)  [2,3]  is  a  more  general  database  de¬ 
pendency  than  multivalued  dependency  (MVD)  [4],  Fischer  and  Van  Gucht  [2] 
gave  several  characterizations  of  WMVD.  In  this  paper,  we  suggest  a  new  cha¬ 
racterization  of  both  MVD  and  WMVD  based  on  equivalence  relations.  In  this 
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framework,  the  difference  between  the  various  database  dependencies  can  be  ea¬ 
sily  understood.  Moreover,  this  new  representation  of  dependencies  provides  a 
convenient  tool  to  readily  derive  other  results. 

This  paper  is  organized  as  follows.  In  Section  2,  we  review  some  pertinent 
notions  in  the  relational  database  model,  and  recall  some  notions  about  equiva¬ 
lence  relations.  We  use  this  framework  to  express  contextual  and  noncontextual 
dependencies  in  Section  3.  In  Section  4,  we  demonstrate  the  simplicity  of  our 
framework  by  showing  the  soundness  of  some  known  inference  axioms.  The  con¬ 
clusion  is  given  in  Section  5. 


2  Basic  Notions 

2.1  Relational  Databases 

Here  we  review  some  notions  used  in  the  elegant  relational  database  model  [4] . 

A  relation  scheme  R  =  {Ai,  A2, . . .,  Am}  is  a  finite  set  of  attributes.  Corre¬ 
sponding  to  each  attribute  Ai  is  a  nonempty  finite  set  D,  ,  1  <  i  <  m,  called  the 
domain  of  Ai .  Let  D  =  D\  U  D2  ■  ■  .  U  Drn.  A  relation  r  on  the  relation  scheme 
R,  written  r(R),  is  a  finite  set  of  mappings  {t1;f2,  •  •  •  ,ts}  from  R  to  D  with 
the  restriction  that  for  each  mapping  ter,  t(Ai)  must  be  in  Dj,  1  <  i  <  k, 
where  t(Ai)  denotes  the  value  obtained  by  restricting  the  mapping  t  to  The 
mappings  are  called  tuples  and  t{A)  is  called  the  A-value  of  t.  We  use  t(X)  in 
the  obvious  way  and  call  it  the  X- value  of  t. 

Mappings  are  used  in  our  exposition  to  avoid  any  explicit  ordering  of  the 
attributes  in  the  relation  scheme.  To  simplify  the  notation,  however,  we  will 
henceforth  denote  relations  by  writing  the  attributes  in  a  certain  order  and  the 
tuples  as  lists  of  values  in  the  same  order.  Furthermore,  the  following  relational 
database  conventions  will  be  adopted  for  simplified  notation.  Uppercase  letters 
A,  B,C  from  the  beginning  of  the  alphabet  may  be  used  to  denote  attributes.  A 
relation  scheme  R  =  {Ai ,  A2, . . . ,  Am}  may  be  written  as  simply  A1A2  ■  .  ■  Am . 
A  relation  r  on  scheme  R  is  then  written  as  either  r(R)  or  r(A\  A2  . . .  Am ) .  The 
singleton  set  {A}  is  sometimes  written  as  A  and  concatenation  XY  may  be  used 
to  denote  set  union  X  L)Y. 

The  select  cr,  project  7r,  and  natural  join  txi  operators  are  defined  as  follows. 

When  the  select  operator  a  is  applied  to  a  relation  r,  it  yields  another  relation 
that  is  a  subset  of  tuples  of  r  with  a  certain  value  on  a  specified  attribute.  Let 
r  be  a  relation  on  scheme  R,  A  e  R,  and  a  G  Da  ■  Then 

&A=a{r)  =  {t  |  t  e  r  and  t{A)  =  a}. 

Whereas  the  select  operator  chooses  a  subset  of  tuples  in  a  relation,  the 
project  operator  7r  chooses  a  subset  of  attributes.  Let  r  be  a  relation  on  R  and 
X  a  subset  of  R.  The  projection  ofr  onto  X,  written  nx  (r),  is  defined  as 


nx(r)  =  {  t(X)  |  ter}. 


(1) 
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The  natural  join  of  two  relations  ri(A)  and  r2(Y),  written  ri(X)  ixi  ^(T), 
is  defined  as 

ri(X)  c*  r2(Y)i  =  {  t{XY)  |  t(X)  G  n(X)  and  t(Y)  G  r2{Y)  }.  (2) 

A  fundamental  database  dependency,  namely,  multivalued  dependency 
(MVD),  can  now  be  defined. 

Definition  1.  Let  X,Y,Z  be  pairwise  disjoint  subsets  of  scheme  R  —  XYZ.  A 
relation  r(XY Z)  satisfies  the  multivalued  dependency  MVD(Y,X,Z),  if  for  any 
two  tuples  ti  and  t2  in  r  with  ti(X)  =  t2(X),  there  exists  a  tuple  t2  in  r  with 
t3(XY)  =  h(XY)  and  t3(Z)  =  t2{Z). 

The  multivalued  dependency  MVD(Y,X,Z)  is  a  necessary  and  sufficient  con¬ 
dition  for  r(XYZ)  to  be  losslessly  decomposed  as 

r(XY Z)  =  7 rxv(r)  co  7 rXz(r).  (3) 


Example  1.  The  following  relation  r\{ABC)  on  the  left  satisfies  the  multivalued 
dependency  MVD(A,B,C),  since 

ri(ABC)  =  7 rAB(ri)  cxi  7rBC(n). 

However,  r2(ABC)  on  the  right  does  not  satisfy  MVD(A,B,C)  since 
r2(ABC)  yf  nAB(r2)  M  rBC{r2). 


ri(ABC) 


,  r2(ABC)  = 


2.2  Properties  of  Equivalence  Relations 

Since  we  suggest  that  dependencies  can  be  conveniently  expressed  using  equiva¬ 
lence  relations,  we  first  recall  some  familiar  notions  about  relations  [6]. 

Given  any  subset  X  C  R,  we  can  define  an  equivalence  relation  0(X)  on  r  (a 
partition  of  r):  for  all  t3  G  r, 

U  0{X)  tj,  if  ti(X)=tj(X).  (4) 

The  composition  operator  o  is  used  to  combine  relations.  Let  T  =  {ti,t2, , 
ts}  denote  a  finite  set  of  objects.  Consider  two  relations  Q\  and  d2  on  T.  The 
binary  operator  o,  called  the  composition,  is  defined  by:  for  £*,£*,  G  T, 


ti{6 1  o  82)tk,  if  for  some  tj  G  T  both  ti9\tj  and  tj02tk- 


(5) 
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It  can  be  shown  that  the  composition  9\  o  02,  of  two  individual  equivalence 
relations  6\  and  62,  is  itself  an  equivalence  relation  (a  partition)  if  and  only  if 
@1  °  @2  =  62  °  @1- 

We  can  then  define  MVD  using  equivalence  relations  as  follows: 

Definition  2.  Relation  r(XY Z)  satisfies  MVD (Y,X,Z),  if 

6{X)  =  9(XY)o0(XZ)  =  9(XZ)o9{XY).  (6) 


3  Generalizing  Multivalued  Dependency 

In  this  section,  we  generalize  MVD  with  both  contextual  and  noncontextual  de¬ 
pendencies.  Contextual  dependencies  only  decompose  a  subset  of  the  relation, 
while  noncontextual  dependencies  decompose  the  entire  relation. 

3.1  Context  Strong  Multivalued  Dependency  (CSMVD) 

Sometimes  only  a  few  tuples  in  a  relation  cause  the  violation  of  an  MVD.  In  this 
section,  we  introduce  context  strong  multivalued  dependency  (CSMVD)  in  order 
to  losslessly  decompose  part  (a  subset)  of  a  relation. 

Consider  the  relation  ri(ABC)  in  Figure  1.  It  can  be  verified  that  ri(ABC) 
does  not  satisfy  MVD(A,B,C).  The  reason  is  because  the  definition  of  MVD 
requires  that  MVD(Y,X=x,Z)  holds  for  all  X-values  x  in  relation  r(XYZ).  In 
this  example,  this  means  that  MVD(A,B=0,C)  and  MVD(A,B=1,C)  must  both 
hold.  However,  it  can  be  seen  that  the  MVD(A,B,C)  holds  when  B=0,  but  not 
when  B=l.  The  important  point  is  that  even  though  the  entire  relation  ri(ABC) 
cannot  be  losslessly  decomposed  using  MVD,  namely, 

7-1  (ABC)  ±  TTAs(ri)  M7rBC(ri), 

it  is  still  possible  to  losslessly  decompose  the  tuples  crs=o(ri)  —  {ti,  t-2,  C,  £4}: 
cb=o(d)  =  T4b(ob=o(d))  m  7rsc(o'B=o(ri)). 


Definition  3.  Relation  r{XYZ)  satisfies  the  context  strong  multivalued  depen¬ 
dency  CSMVD  (Y,X=x,Z),  if  the  equivalence  class  defined  by  X  =  x  in  the  equi¬ 
valence  relation  9{X)  satisfies  the  following  condition: 


6{X  =  x)  =  9(X  =  xY)  o  9(X  =  xZ)  =  Q(X  =  xZ)o6(X  =  xY).  (7) 
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Fig.  1.  Relation  n(ABC)  satisfies  CSMVD(A,B=0,C).  Relation  r2(ABC)  satisfies 
WMVD(A,B,C).  Relation  r3(ABC)  satisfies  CWMVD(A,B=0,C). 


Example  2.  Context  Strong  Multivalued  Dependency.  Let  us  verify  that  rela¬ 
tion  ri(ABC)  in  Figure  1  satisfies  CSMVD(A,B=0,C).  By  Equation  (4),  we 
first  obtain 


0(5  =  0)  =  {[tl, <2,^3, <4]}- 

By  another  application  of  Equation  (4),  we  obtain  the  equivalence  relations: 

0(AB  =  O)  =  {[ti,t2].[*3,*4]}, 

and 

6(B  =  0C)  =  {[«i,t3],[t2>«4]}. 

Applying  Equation  (5)  gives: 

6{AB  =  0)  o  9{B  =  0C)  =  {[*i,*2,t3,*4]}  =  0(B  =  0C)  o  9{AB  =  0). 

We  have  our  desired  result  since: 

0(5  =  0)  =  9(AB  =  0)  o  d(B  =  0C)  =  9(B  =  0C)  o  9{AB  =  0). 

However,  it  can  be  similarly  verified  that  CSMVD(A,B=1,C)  is  not  satisfied. 

CSMYD  generalizes  MVD  by  only  decomposing  some  of  the  tuples  in  a  re¬ 
lation.  However,  it  is  also  possible  to  generalize  MVD  with  a  noncontextual 
dependency,  called  weak  multivalued  dependency  (WMVD),  which  decomposes 
all  of  the  tuples  in  a  relation. 

3.2  Weak  Multivalued  Dependency  (WMVD) 

Weak  multivalued  dependency  (WMVD)  [3]  generalizes  MVD(Y,X,Z)  in  Defini¬ 
tion  (2)  by  not  requiring  the  equivalence  relation  9(XY)  o  9(XZ)  =  9(XZ)  o 
9[XY)  to  be  equal  to  9(X). 
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Definition  4.  Relation  r(XYZ)  satisfies  the  weak  multivalued  dependency 
WMVD(Y,X,Z),  if 

e{XY)  o  o(xz)  =  e{xz)  o  e(XY).  (8) 


Example  3.  Weak  Multivalued  Dependency.  Let  us  verify  that  relation  r2(ABC) 
in  Figure  1  satisfies  WMVD(A,B,C).  By  Equation  (4),  we  obtain: 

6{AB)  =  {[ti,t2],  [*3,  *4],  [£5],  [£e],  [h\,  [*8]}> 

and 


9(BC)  =  {[ti,£3]j  [£2, £4],  [£5, £e],  [£7,  £s]}- 
Applying  Equation  (5),  we  obtain  our  desired  result  since: 

6{AB)  o  9(BC)  =  {[£1, £27  £35  £4],  [£51  £6]>  [£7?  £s] }  =  9(BC)o6(AB). 


Thus,  even  though  a  relation  does  not  satisfy  MVD(Y,X,Z),  it  may  still  be 
possible  to  losslessly  decompose  the  entire  relation  using  WMVD(Y,X,Z). 

3.3  Context  Weak  Multivalued  Dependency  (CWMVD) 

We  can  introduce  a  contextual  version  of  WMVD,  called  context  weak  multiva¬ 
lued  dependency  (CWMVD). 

Definition  5.  Relation  r(XYZ)  satisfies  the  context  weak  multivalued  depen¬ 
dency  CWMVD (Y,X=x,Z),  if  there  exists  a  maximal  disjoint  compatibility  class 
{t{, . . .  ,tj}  in  the  relation  0(X  =  xY)  o  9(X  =  xZ). 

Definition  5  implies  that  {£*, . . . , £,}  satisfies  MVD(Y,X,Z). 

Example  Jt.  Context  Weak  Multivalued  Dependency.  To  verify  that  relation 
r3(ABC)  in  Figure  1  satisfies  WMVD(A,B=0,C),  we  first  obtain: 

9(AB  =  0)  =  {[£1, £2],  [£3, £4],  [£5, £e],  [£7]}! 

and 


9(B  =  0C)  =  {[£1, £3],  [£2,  £4],  [£5, £7],  [£e]}- 
Applying  Equation  (5),  we  obtain  R  =  9{AB  =  0)  o  9{B  =  0(7): 

R  =  {ti7Zti,tilZt2,tl'Rd'3itl'Rd4it2'R-ti,t2'R-t‘2,t2lZtz,t27Zt4, 
tYRti,  t3Rt2 ,  t3Rt3,  t3Rt4,  t4TZti,  t4Rt2,  t4Rt3,  t4Rt4, 

t5nt5,tsllt6,  £5^7,  teRt3,teTZtQ,  teRt-j,  t^Rts,  £7 Rt7 }• 

Note  that  t-^Rt^  is  not  a  member  in  R.  Thus,  {£1 ,  £2 7  £3  ?  £4}  is  a  maximal  disjoint 
compatibility  class,  i.e.,  {£1 ,  £2,  £3^  £4}  satisfies  MVD(A,B,C).  Therefore,  relation 
r(ABC)  satisfies  CWMVD(A,B=0,C). 
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4  Comparing  Strong  Versus  Weak  Dependencies 

Our  purpose  in  this  section  is  show  that  weak  dependencies  are  more  general 
than  strong  dependencies. 

Lemma  1.  [2,3]  MVD  is  a  special  case  of  WMVD. 


Lemma  2.  CSMVD  is  a  special  case  of  CWMVD. 

Similarly,  contextual  dependencies  are  more  general  than  their  noncontextual 
counterparts. 

Lemma  3.  CSMVD  is  a  more  general  dependency  than  MVD. 


Lemma  4.  CWMVD  is  a  more  general  dependency  than  WMVD. 

The  relationships  between  all  of  these  dependencies  can  be  summarized  as: 

MVD  =»  WMVD  =►  CWMVD, 

and 

MVD  =»  CSMVD  =>  CWMVD. 

It  should  be  noted  that  WMVD  does  not  logically  imply  CSMVD,  and  vice 
versa.  For  example,  relation  r^(ABC)  in  Figure  1  satisfies  WMVD(A,B,C),  but 
not  CSMVD(A,B=0,C).  On  the  other  hand,  relation  ri(ABC')  in  Figure  1  satis¬ 
fies  CSMVD(A,B=0,C),  but  not  WMVD(A,B,C). 

5  Axiomatization  of  the  Noncontextual  Dependencies 

By  expressing  dependencies  using  equivalence  relations,  it  is  straightforward  to 
show  the  soundness  of  several  inference  axioms. 

The  following  two  axioms  (MW1)  and  (MW2)  are  a  sound  and  complete 
axiomatization  for  the  mixture  of  MVD  and  WMVD  [7] : 

(MW1)  If  MVD(Y,X,Z),  then  WMVD(Y,X,Z); 

(MW2)  If  WMVD(Y,XZ,W),  WMVD(Y,XW,Z),  and  MVD(Z,XY,W), 

then  WMVD(Y,X,ZW). 

The  soundness  of  axiom  (MW1)  follows  directly  from  the  definitions  of  MVD 
and  WMVD.  By  definition,  WMVD(Y,XZ,W),  WMVD(Y,XW,Z),  and  MVD(Z, 
XY,W)  imply: 


6{XZY)  o  6{XZW)  =  6{XZW)  o  d(XZY),  (9) 


6{XWY)  o  Q{XZW)  =  9{XZW)  o  6(XWY) 


(10) 
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and 


0(XY)  =  0{XYZ)  o  0(XYW)  =  O(XYW)  o  0(XYZ),  (11) 

respectively.  Using  Equations  (9)-(ll)  it  follows 

0(XY)  o  0{XZW)  =  e(XYZ)  O  O(XYW)  O  0{XZW) 

=  e(XYZ)  O  ff(XZW)  O  0(XYW) 

=  0{XZW)  O  0(XYZ)  o  9(XYW) 

=  0(XZW)o0(XY).  (12) 


Equation  (12)  indicates  that  WMVD(Y,X,ZW)  as  desired. 

As  a  second  example,  the  following  four  inference  axioms  (Wl)-(W4)  are  a 
sound  and  complete  axiomatization  for  WMVD  [2]: 

(Wl)  UUCX,  then  WMVD(U,X,YZ); 

(W2)  If  WMVD(YX,XVZ,W),  then  WMVD(Y,XVZ,W)  and 
WMVD(YXV,XVZ,W); 

(W3)  If  WMVD(Y,X,ZW) ,  then  WMVD(Y,XZ,W); 

(W4)  If  WMVD(Y,X,Z),  then  WMVD(Z,X,Y). 

Two  properties  [6]  of  U  C  X  are  that  0{XU)  =  0{X)  and 

0(U)  o  0(X)  =  0(X)  o  0(U)  =  0(X). 


Thus,  inference  axiom  (Wl)  is  sound  since  0(XU)o0(XYZ)  =  0(X)o 0[XYZ)  = 
0(XYZ)  =  0(XYZ) o0(X)  =  0{XYZ)oO(XU).  Therefore,  WMVD(U,X,YZ). 
To  show  the  soundness  of  (W2),  we  are  given: 

0{XVZYX)  o  0[XVZW)  =  0(XVZW)  o  0{XVZYX), 


or  equivalently, 

0{XVZY)  o  0{XVZW)  =  0(XVZW)  o  0{XVZY). 

This  is  the  definition  of  WMVD(Y,XVZ,W).  Now  consider 

0{XVZYXV )  o  0(XVZW)  =  0(XVZY)  o  0(XVZW) 

=  9(XVZW)o0(XVZY). 

This  is  the  definition  of  WMVD(YXV,XVZ,W). 

In  inference  axiom  (W3),  we  are  initially  given: 

0{XY)  o  0(XZW)  =  9(XZW)  o  0(XY). 


We  want  to  show: 


0{XYZ)  o  0{XZW)  =  0(XZW)  o  0(XYZ). 
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Consider 


ti 9{XYZ)t2  and  t29{XZW)t3. 

Since  ti(XYZ)  =  t2(XYZ),  this  implies  that 

tx9{XY)t2  and  t20(XZW)t3. 

By  the  given  WMVD(Y,X,ZW),  we  obtain 

tx9{XZW)U  and  t49{XY)t3. 

What  remains  to  be  shown  is  that  t±(XY Z)  =  t3(XY Z),  namely,  f4(Z)  =  t3(Z). 
Now  i4(Z)  =  t\(Z)  =  t2(Z)  ~  t3(Z).  Therefore,  we  have  our  desired  result: 

h9{XZW) f4  and  t49{XYZ)t3. 

The  soundness  of  (W4)  follows  directly  from  Definition  4. 

6  Conclusion 

In  this  paper,  we  have  suggested  a  new  characterization  of  MVD  and  WMVD  ba¬ 
sed  on  equivalence  relations.  This  characterization  clearly  exhibits  the  difference 
between  not  only  these  two  database  dependencies,  but  also  their  contextual 
counterparts.  By  expressing  MVD  and  WMVD  with  equivalence  relations,  other 
results  can  be  readily  shown  as  we  demonstrated  by  proving  the  soundness  of  the 
corresponding  inference  axioms.  More  importantly,  the  results  here  can  be  ap¬ 
plied  to  the  recent  interest  in  contextual  probabilistic  conditional  independence 
in  Bayesian  networks. 
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Abstract  The  conventional  use  of  databases  is  commonly  restricted  to  the  re¬ 
trieval  of  factual  data  in  the  form  of  tuples  or  records.  However  most  databases 
also  contain  metadata  in  the  form  of  integrity  rules  which  can  provide  a  rich 
source  of  additional  information  not  normally  available  to  the  user.  Integrity 
rules  define  what  data  values  and  relationships  may  exist  within  the  database 
and  so  their  interrogation  can  provide  answers  as  to  whether  a  certain  database 
state  is  possible.  Our  paper  describes  how  this  may  be  achieved  and  specifies  a 
formal  approach  to  implementing  such  an  enquiry  system. 


1  Introduction 

A  database  can  be  seen  as  comprising  an  extension,  which  is  the  set  of  tuples  repre¬ 
senting  the  current  state  of  the  database  content,  and  the  schema  which  describes  the 
structure  of  and  permitted  relationships  within  the  database.  An  important  component 
of  the  latter  is  the  set  of  integrity  rules  (constraints)  which  define  the  conditions  which 
must  apply  to  all  entries  within  the  extension  [2,3,5].  These  rules  are  enforced  when¬ 
ever  changes  are  made  to  the  database  content.  Most  relational  systems  only  permit 
query  execution  against  the  extension,  the  query  result  then  consisting  of  a  subset  of 
the  current  database  content.  This  means  that  query  answers  are  limited  to  the  current 
situation  since  they  relate  to  the  particular  state  of  the  database  at  the  time  of  the 
query.  Integrity  rules,  on  the  other  hand,  embody  information  which  defines  all  legal 
states  of  the  database,  that  is  they  specify  what  is  possible. 

Being  able  to  access  integrity  information  would  permit  the  formulation  of  modal 
queries  i.e.  queries  about  what  can  be  or  must  be  the  case.  Examples  of  modal  queries 
might  be  'must  all  managers  who  earn  more  than  £30,000'  and  work  in  London,  be 
provided  with  a  company  car?',  or  'under  what  conditions  can  Jones  earn  £25,000?'. 
Most  current  systems  permit  only  limited  access  to  the  integrity  rules  and  do  not  allow 
the  user  to  formulate  this  type  of  query,  even  though  the  information  is  available  to 
provide  an  answer.  Note  that  this  type  of  question  is  different  from  the  'what  if  sce¬ 
nario  of  say  financial  planning  [12]  or  AI  reasoning  [14]  where  the  user  is  interested  in 
the  implications  of  some  hypothetical  update  of  the  database  which  is  known  to  be 
valid,  see  also  earlier  work  [13].  We  are  here  concerned  with  whether  certain  database 
states  are  permissible  with  respect  to  the  integrity  rules  and,  if  not,  what  further  condi¬ 
tions  would  make  them  permissible. 
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In  the  following  sections  we  describe  an  approach  to  processing  certain  classes  of 
modal  query  which  can  be  used  to  enhance  and  augment  an  existing  conventional 
query  system.  Section  2  outlines  the  basis  for  our  approach  and  presents  a  restricted 
specification  for  the  query  constructs.  In  section  3  we  show,  through  a  series  of  exam¬ 
ples,  the  algorithm  for  generating  a  modified  database  state  which  reflects  the  re¬ 
quirements  of  the  modal  query.  The  process  of  evaluating  this  changed  state  against 
the  integrity  rules  is  then  described  in  section  4  while  finally  our  conclusions  are  pre¬ 
sented  in  section  5.  This  work  is  an  extension  to  the  programme  of  research,  by  the 
authors,  into  the  generation,  evaluation  and  application  of  rules  in  databases  [6,7,8, 1 1]. 


2  Representation  of  Modal  Queries 

Referring  back  to  our  second  example  above,  we  can  see  that  were  we  to  simulate  a 
modification  to  the  database  so  that  Jones  did  indeed  earn  £25,000  and  then  determine 
whether  the  modified  state  violated  any  of  the  integrity  rules  then  we  would  know  the 
answer  to  the  query.  Any  violations  could  be  reported  together  with  a  brief  explanation 
of  their  nature.  If,  however,  there  are  no  rule  violations,  associated  with  the  hypotheti¬ 
cal  update,  then  it  may  be  seen  that  there  are  no  required  conditions  for  the  specified 
state  to  be  permissible.  The  basis  of  our  approach  is  therefore  to  generate  a  set  of 
modifications  to  the  existing  database  state  which  makes  the  query  true,  and  then 
evaluate  the  integrity  rules  with  respect  to  this  modified  state. 

In  relational  systems,  integrity  rules  are  traditionally  expressed  in  some  form  of  the 
predicate  calculus  [4,9,10],  however,  in  order  not  to  restrict  ourselves  to  any  particular 
implementation,  we  will  use  a  standard  version  of  the  Tuple  Relational  Calculus 
(TRC)  to  express  both  queries  and  integrity  rules  [4].  To  simplify  matters  further  we 
will  concern  ourselves  only  with  the  modal  operator  of  possibility  P,  and  make  the 
assumption  that  modal  expressions  may  contain  only  existential  quantification,  con¬ 
junction  and  the  standard  comparison  operators.  The  syntax  of  our  restricted  TRC 
subset  is  therefore: 

Modal  queries:  ¥<f) 

Sentences  <p,  rj  ::  =  {(p a  Tj )  \{3vET<f> )  I (x  By) 

Terms  x,  y  ::  =  K  I  v.a 

Rel  6  ::=  >\  <  \  =  \  *  |  >  |  < 

where  t  e  Type,  k  e  Constant,  v  e  Variable,  a  e  Attribute 
A  modal  query  of  the  form  P  <j>  would  be  checked  initially  to  determine  whether  (f>  was 
true  with  respect  to  the  current  database  state.  If  this  were  the  case  then  the  query 
processing  mechanism  terminates  with  a  response  to  the  user  that  the  proposition  is 
currently  true.  If,  however,  it  is  the  case  that  </>  does  not  hold  with  respect  to  the  cur¬ 
rent  database  state  then  we  need  to  find  some  modification  to  the  existing  state  which 
makes  V<b  true  and  then  check  this  modified  state  against  the  integrity  rules. 
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3  Generating  Hypothetical  Modifications  to  the  Database 

The  process  of  generating  database  modifications  which  make  (f)  true  involves  build¬ 
ing  a  representation  of  the  basic  requirements  of  the  query  and  then  instantiating  this 
representation  according  to  information  currently  in  the  database  relations.  The  proc¬ 
ess  will  now  be  illustrated,  by  means  of  a  simple  example,  with  respect  to  the  follow¬ 
ing  database  relation: 

EMP(name,  status,  salary,  department ). 
which  contains  the  tuple: 

(name  =  Jones,  status  =  full-time,  salary  =  35K,  department  =  Sales) 

Let  us  assume  that  we  wish  to  pose  the  query  ’Can  Jones  work  in  Accounts?’.  This 
query  translates  to  the  TRC  expression: 

(pi :  P3xe  EMP[x.name  =  Jones  a  x.department  =  Accounts] 

To  begin  with,  we  encode  the  requirements  of  the  query  using  a  procedure  taking 
four  arguments:  the  TRC  expression  (p,  a  variable  binding  environment  E,  an  incoming 
list  of  tuples  IN  and  an  outgoing  list  of  tuples  OUT.  The  initial  state  is  a  full  TRC 
expression,  an  empty  environment  and  an  empty  list  of  incoming  tuples,  whilst  the 
outgoing  tuples  are  unknown. 

Thus,  for  the  above  example: 

TRC  =  3xe  EMP[x.name= Jones  a  x.department-Accounts] 

E  =  NULL 

IN  =  NULL 

OUT  =  ? 

For  each  occurrence  of  an  existential  quantifier,  in  the  query  expression,  we  gener¬ 
ate  a  unique  tuple  identifier  together  with  an  extended  environment  in  which  the 
quantified  variable  is  associated  with  this  identifier.  For  each  associated  relation,  we 
also  define  a  tuple  template  in  which  each  attribute  value  is  associated  with  a  unique 
uninstantiated  variable.  The  template(s),  together  with  the  tuple  identifier(s),  is  then 
added  to  the  incoming  list,  IN.  We  may  now  process  the  body  of  the  quantified  ex¬ 
pression  with  respect  to  the  new  environment  and  the  extended  incoming  list. 

Thus: 

TRC  =  x.name=Jones  a  x.department=Accounts 
E  =  [x:tl  ] 

IN  =  ft] :  (name=ul,  status-  u2,  salary=  u3,  department=  u4)eEMP ] 

OUT=  ? 

Conjunction  is  handled  by  processing  each  branch  of  the  conjunct  separately.  Each 
conjunct  will  inherit  the  same  environment,  but  the  left  conjunct  is  processed  with 
respect  to  the  IN  list  of  the  whole  conjunction  whilst  the  IN  list  of  the  right  conjunct  is 
set  to  the  OUT  list  of  the  left  conjunct. 
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We  will  consider  the  left  conjunct  first. 

TRC  =  x.name= Jones 
E  =  [x:tl  ] 

IN  =  [tl:  (name=  ul,  status=  u2,  salary  =  u3,  department=  u4)eEMP] 

OUT  =  ? 

This  is  the  base  case  for  the  recursive  process  and  we  now  unify  [1]  the  objects 
denoted  by  each  side  of  the  equality.  To  do  this  we  retrieve  the  value  associated  with 
x.name  by  noting  that  the  variable  x  is  associated  with  the  tuple  tl,  and  tl  identifies 
the  tuple  whose  name  attribute  has  the  value  ul.  Unifying  Jones  with  ul  therefore 
instantiates  tuple  tl ’s  name  attribute  to  the  value  Jones.  In  cases  where  there  is  a  vari¬ 
able  on  each  side  of  the  equality,  unification  will  ensure  that  the  appropriate  attributes 
take  the  same  value,  whilst  if  there  is  a  constant  on  each  side  then  the  unification  pro¬ 
cess  will  only  succeed  if  their  values  are  the  same. 

Having  unified  the  appropriate  objects  we  now  simply  copy  the  IN  list  to  the  OUT 
list  and  begin  to  unwind  the  recursion.  We  will  therefore  exit  with  the  values: 

TRC  =  x.name=Jones 

E  =  [x:tl  ] 

IN  =  [tl:  (name=Jones,  status-  u2,  salary=  u3,  department =  u4)eEMP] 

OUT  =  [tl:  (names  Jones,  status-  u2,  salary-  u3,  department^  u4)eEMP] 

We  may  now  proceed  to  process  the  right  branch  of  the  conjunction,  which  inherits  its 
IN  list  from  the  OUT  list  of  the  left  conjunct. 

TRC  =  x.department=Accounts 

E  =  [x:tl] 

IN  =  [tl:  (name -Jones,  status=  u2,  salary=  u3,  departments  u4)eEMP] 

OUT  =  ? 

This  is  also  a  base  case  and,  using  the  procedure  described,  we  exit  with  the  values: 
TRC  =  x.department=Accounts 

E  =  [x:tl  ] 

IN  =  [tl:  (name=Jones,  status=  u2,  salary=  u3,  department=Accounts)  eEMP ] 
OUT  =  [tl:  (name=Jones,  status=  u2,  salary=  u3,  department= Accounts)  eEMP] 

Each  branch  of  the  conjunction  has  now  been  processed  and  we  exit  by  setting  the 
OUT  list  of  the  whole  conjunction  to  be  the  OUT  list  of  the  right  conjunct,  ie. 

TRC  =  x.name= Jones  a  x.department=Accounts 
E  =[x:tl] 

IN  =  [tl:  (name=Jones,  status=  u2,  salary=  u3,  department=Accounts) eEMP] 
OUT  =  [tl:  (name=Jones,  status=  u2,  salary=  u3,  departments  Accounts )  eEMP  ] 

The  final  stage  of  our  query  processing  algorithm  is  to  set  the  OUT  list  of  the  exis¬ 
tentially  quantified  expression  to  be  that  of  the  body  of  the  expression  (the  conjunc¬ 
tion): 
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TRC  —  3xe  EMP [x.name=J ones  a  x.department= Accounts] 

E  =  NULL 
IN  =  NULL 

OUT  =  [tl:  (name— Jones,  status=  u2,  salary=  u3,  department- Accounts)  eEMP] 

and  we  see  that,  in  order  for  the  TRC  expression  associated  with  our  query  to  be  true, 
a  tuple  of  the  form: 

(name=Jones,  status-  u2,  salary—  u3,  dep artment— Accounts)  eEMP 
must  be  present  in  the  database. 

Before  attempting  to  instantiate  this  with  respect  to  the  current  database  state  we 
must  merge  any  generated  tuples  which  have  the  same  key  fields.  For  example,  the 
TRC  translation  for  ‘Can  Jones  work  in  Accounts  and  earn  30K?’  is: 

P{ 3xe  EMP [x.name— Jones  a  x.department= Accounts] 

a  3xe  EMP [x.name= Jones  a  x.salary=  30K] 

Applying  the  above  algorithm,  returns  two  tuples: 

(name= Jones,  status=  u2,  salary  =  u3,  department- Accounts  j  eEMP 

(name-Jones,  status —  u2,  salary —  30K,  department —  u4}eEMP 

Since  name,  however,  is  the  key  field  of  EMP,  these  tuples  must  be  the  same  and  we 
therefore  merge  them  to  produce  a  single  tuple: 

(name— Jones,  status—  u2,  salary—  30K,  department— Accounts}  eEMP 

The  process,  described  above,  can  fail  at  two  points,  namely  when  we  attempt  to 
unify  both  sides  of  an  equality  which  equates  different  values,  and  when  we  attempt  to 
merge  tuples  whose  keys  are  the  same  yet  which  differ  in  some  other  attribute.  In  both 
cases  the  underlying  problem  is  that  the  query  is  about  a  contradictory  database  state, 
ie  a  state  which  would  be  impossible  regardless  of  any  constraints  that  might  exist.  In 
such  cases  we  proceed  no  further,  reporting  this  reason  for  failure  to  the  user. 

Having  generated  the  query  encoded  tuples,  the  next  stage  involves  hypothetically 
modifying  the  database  so  that  it  contains  tuples  which  match  their  characteristics. 
Changes  to  a  database  may  be  effected  in  three  main  ways:  inserting  a  new  tuple, 
modifying  an  existing  tuple  or  deleting  an  existing  tuple.  Our  system  makes  use  of 
inserts  and  updates,  there  being  one  of  these  operations  for  each  query  tuple  generated 
by  the  query  expression.  The  information  contained  in  the  tuple  determines  which  of 
the  above  operations  is  applied.  Where  the  key  field  of  the  query  tuple  is  fully  instanti¬ 
ated  and  matches  the  key  field  of  an  existing  tuple,  then  the  information  in  the  query 
tuple  will  be  used  to  update  the  existing  tuple.  If  there  is  no  key  match  then  the  query 
tuple  will  be  used  to  insert  a  new  tuple  into  the  database. 
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Updates 

We  first  consider  modal  queries  which  lead  to  updates  only.  Returning  to  our  example, 
it  is  clear  that  a  tuple  of  the  form: 

tl(name=Jones,  status=  u2,  salary=  u3,  department^  Accounts}  eEMP 

must  be  present  in  the  database  for  the  query  to  be  true.  However,  it  is  the  case  that 
the  tuple: 

(name= Jones,  status-  ‘full-time’,  salary=  35K,  department= Sales) 
already  exists. 

Thus  we  have  the  situation  where  the  key  of  the  query  tuple  is  instantiated  and 
matches  the  key  field  of  an  existing  tuple.  The  other  attributes  of  the  query  tuple  may 
now  be  instantiated  to  provide  a  specification  of  the  appropriate  tuple.  This  is  accom¬ 
plished  by  setting  each  uninstantiated  attribute  to  the  value  of  the  associated  attribute 
in  the  existing  tuple.  Thus  the  query  tuple,  in  our  example,  becomes: 

tl(name=Jones,  status=  full-time,  salary=  35K,  depart ment= Accounts)  eEMP 

The  next  step  is  to  hypothetically  update  the  existing  tuple  to  the  new  values  speci¬ 
fied  and  check  the  integrity  rules,  using  a  standard  integrity  enforcement  mechanism, 
with  respect  to  the  modified  state.  If  none  are  violated  then  the  modified  state  is  legal, 
and  the  system  responds  to  the  modal  question  in  the  affirmative.  If  there  are  any 
violations,  then  these  can  be  reported  together  with  an  explanatory  message  indicating 
which  rules  were  violated  and  thus  the  reasons  why  the  state  specified  by  the  query  is 
not  possible.  The  database  is  then  returned  to  its  original  state. 

Ambiguous  Queries 

Unlike  conventional  queries,  modal  queries  may  involve  ambiguity  which  is  only 
made  apparent  when  the  query  tuples  are  instantiated  with  respect  to  the  current  data¬ 
base.  Consider,  for  example,  the  query  'Can  Jones  earn  Smith’s  salary?'  which  trans¬ 
lates  to: 

(j)2:  P 3x eEMP Ely eEM P[x. name  =  Jones  Ay.name-  Smith  Ax.salary=  y. salary] 
This  generates  the  two  query  tuples: 

tl  (name  =  Jones,  status  =  u2,  salary  =u3,  department  =  u4}  )e  EMP 
t2{name  =  Smith,  status  =  u6,  salary  =  u3,  department  =  u8}  )e  EMP 
where  the  salary  attributes  of  both  tuples,  though  uninstantiated,  have  the  same  value. 

Let  the  EMP  relation  contain  the  following  tuples: 
l name  =  Jones,  status  =  full-time,  salary  =  35K,  department  =  Sales } 

{ name  =  Smith,  status  =  part-time,  salary  =  40K,  department  =  Engineering} 

Note  that  both  of  the  query  tuples  will  result  in  updates,  since  their  keys  match  exist¬ 
ing  tuples,  but  the  resulting  database  state  will  be  determined  by  which  of  the  tuples 
we  instantiate  first.  Choosing  ‘Jones’  gives: 
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tl{name= Jones,  status=full-time,  salary =  35K,  department=Salesj  eEMP 
t2( name -Smith,  status^  u6,  salary =  35K,  department =  u8})eEMP 

and  we  see  that  instantiating  Jones’  tuple  also  sets  the  salary  attribute  of  Smith’s  tuple. 
Instantiating  this  second  tuple  with  respect  to  the  database  results  in: 

tl{name= Jones,  status- full-time,  salary=35K,  department -Sales})  eEMP 
t2{name=Smith',  status=  part-time, salary=35K,  department=Engineering} eEMP 

The  overall  effect  then,  is  of  leaving  Jones’  tuple  unchanged  while  setting  Smith’s 
salary  to  that  of  Jones.  Instantiating  Smith’s  tuple  first,  however,  gives: 

tl  { name  =  Jones,  status  =  u2,  salary  =  40K,  department  =  u4})e  EMP 
t2{name=Smith,  status -part-time,  salary-40K,  department^  Engineering}  eEMP 

then  instantiating  Jones’  tuple  produces  the  following  query  tuples: 

tl{name= Jones,  status =  fidl-time,  salary=40K,  department=Sales } )  eEMP 
t2{name=Smith,  status -part-time,  salary=40K,  department=Engineering}  eEMP 

In  this  latter  case,  the  resulting  database  state  would  reflect  that  Smith’s  tuple  was 
unchanged,  and  Jones’  salary  is  set  to  Smith’s.  Whilst,  intuitively,  this  modification 
seems  more  consistent  with  what  was  intended,  both  alternatives  would  appear  to  be 
acceptable.  The  system  must  therefore  recognise  that  ambiguity  can  arise,  in  this  type 
of  situation,  and  be  able  to  present  the  user  with  some  means  of  choosing  between  the 
alternatives  available. 


Insertions 

We  now  consider  modal  queries  which  generate  tuple  inserts  to  the  database.  Let  us 
assume  that  we  wish  to  check  whether  certain  employment  conditions  are  valid  before 
making  an  offer  of  employment  to  a  candidate  named  Walker.  A  possible  query  might 
be  ’Can  Walker  have  part-time  working  status  and  earn  35K  and  work  in  Engineer¬ 
ing?’  This  translates  to  the  following  TRC  query: 

<fB:  P3xe  EMP[x.name=Walker a  x.status=part-time  a  x.salary=35K 
a  x.department=Engineering] 

resulting  in  the  single  query  tuple: 

tl(name=Walker,  status=part-time,  salary=35K,  department=Engineering) eEMP 

There  is  no  existing  tuple  with  a  key  field  value  of  Walker,  but  all  the  attributes  are 
instantiated.  An  attempt  may  therefore  be  made  to  insert  this  hypothetical  tuple  into 
the  database.  Should  the  transaction  lead  to  an  integrity  violation  then  this  would  im¬ 
ply  that  the  above  conditions  of  employment  were  not  valid  with  respect  to  the  data¬ 
base. 
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Evaluation  of  ‘must’  type  queries  may  be  seen  conceptually  as  the  construction  of  two 
possibility  sub-queries.  Thus  the  query  ‘must  all  full-time  salespersons  earn  at  least 
20K  ?’  would  translate  to: 

(1)  ‘Can  a  person  be  full-time  and  work  in  Sales  ?’ 
and  if  so  (2)  ‘Can  a  person  be  full-time  and  work  in  Sales  and  earn  less  than  20K  ?’ 

A  positive  response  to  (1)  and  a  negative  response  to  (2)  would  be  required  to 
establish  the  truth  of  the  original  query.  It  is  necessary  to  establish  the  truth  of  the  first 
sub-query  since  a  negative  response  to  the  second  alone  may  be  attributable,  for  ex¬ 
ample,  to  violation  of  an  integrity  rule  which  requires  that  salespersons  have  to  be 
either  ‘occasional’  or  ‘part-time’. 

In  practice  it  is  sufficient  to  pose  the  first  sub-query  and  then  let  the  system  deter¬ 
mine  what  assumptions  need  to  be  made  regarding  any  unknown  values  in  order  for 
the  modal  record  to  be  a  legal  transaction  with  respect  to  the  integrity  rules.  Thus  if 
one  such  assumption  were  that  the  employee  earns  20K,  or  more,  then  this  would 
imply  a  positive  response  to  the  original  question.  This  process  is  discussed  further  in 
the  next  section. 

4  Integrity  Rule  Evaluation 

In  cases  where  the  database  is  modified  using  query  tuples  in  which  all  attribute  values 
are  known,  the  program  for  checking  the  new  database  state  against  the  integrity  rules 
is  relatively  straightforward.  Problems  arise,  however,  when  the  query  tuples  contain 
unknown  values  ie.  ones  which  have  not  been  instantiated.  In  such  cases  it  is  necessary 
for  the  algorithm  to  make  assumptions  about  these  values  based  on  the  conditions 
imposed  by  the  integrity  rules.  If  a  set  of  assumptions  can  be  established  which  meets 
the  criteria  for  satisfying  all  the  integrity  rules  then  the  database  state  is  deemed  con¬ 
sistent  and  the  query  may  be  answered  in  the  affirmative.  If,  however,  no  such  set  of 
assumptions  can  be  constructed  then  the  hypothetical  update  fails  and  the  query  con¬ 
ditions  cannot  be  met. 

For  example,  assume  that  our  database  relation  is  extended  to  include  the  attribute 
location  and  we  wish  to  determine  under  what  conditions  a  new  employee  can  work  in 
Engineering  and  also  be  located  in  London.  This  translates  to  the  modal  query: 

$5:  P 3xe  EMP[x.department= Engineering  a  x.location=London] 
resulting,  since  no  further  values  may  be  instantiated,  in  the  query  tuple: 

tl(name=ul,  status=u2,  salary- u3,  department=Engineering,  location^ London) 
eEMP 

Assume  further  that  we  have  the  following  integrity  rules: 

VP1: VxeEMP[  x. status  =  {part-time,  full-time,  occasional}] 

'¥2:\/xsEMP[x.status  =  occasional  x.salary  <  10K] 

¥3 :  Vx eEM P[ x. department=  Engineering  x.salary  >  20K] 

XF4:  Vx  eEMP[ x. department= Engineering  x.  location=  { Hull, London,  Swindon } ] 
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The  evaluation  algorithm  processes  through  the  rules  as  follows: 

(i)  'PI  is  true  if  (m2  =  { part-time ,  full-time,  occasional}) 

(ii)  Y2  is  true  if  (u2  =  occasional)  a(u3  <  10K) 

(iii)  'KJ  is  true  if  (u3  >  2 OK) 

(iv)  YA  is  true. 

Clearly  (ii)  and  (iii)  are  inconsistent  and  the  algorithm  now  re-examines  the  necessary 
conditions  which  make  these  rules  true.  Simple  analysis  reveals  that  Y2  is  also  satis¬ 
fied  by: 

(a)  -t  ( u2  -  occasional)  a  (u3  <  1  OK) 
or  (b)  -i(u2  =  occasional)  a -i(u3  <  10K) 

Whilst  (a)  still  yields  an  inconsistency,  it  may  be  seen  that  (b)  makes  the  query  true 
with  respect  to  the  rules  and  so  the  necessary  conditions  for  answering  in  the  affirma¬ 
tive  are  that: 

(u2  =  / part-time ,  full-time,  occasional)  a  —,(u2  =  occasional)  )  a  (u3  >  20K) 

The  system  response  is  therefore  that  the  database  state  implied  by  the  original  ques¬ 
tion  is  possible  provided  that  the  employee’s  status  is  ‘part-time’  or  ‘full-time’  and 
that  his/her  salary  is  greater  that  20K. 

In  practice  the  algorithm  can  be  made  to  terminate  when  a  set  of  necessary  condi¬ 
tions  is  found  or,  alternatively,  it  may  continue  to  find  all  possible  sets  of  necessary 
conditions  and  report  to  the  user  that  any  one  of  these  sets  will  result  in  an  affirmative 
answer.  This  latter  response  enables  the  user  to  explore  the  complete  range  of  possi¬ 
bilities  for  making  the  desired  state  true.  Assumptions  common  to  all  possible  sets 
imply  a  necessary  condition  which  must  be  met. 

Whilst,  in  the  interests  of  clarity,  we  have  limited  ourselves  to  relatively  simple 
examples  in  presenting  the  rule  evaluation  process,  the  working  algorithm  is  equally 
suited  to  large  and  complex  integrity  rule  sets. 


5  Conclusions 

In  this  paper,  we  have  explained  how  it  is  possible  to  provide  an  extended  database 
answering  system  which  permits  modal  queries  against  the  database  integrity  rules.  An 
algorithm  has  been  described  which  takes  the  query  and  from  it  constructs  a  set  of 
database  modification  tuples  which,  when  applied,  bring  about  a  changed  state  of  the 
database  which  is  consistent  with  the  original  question. 

This  changed  state  may  then  be  checked  against  the  integrity  rules  to  determine 
whether  the  required  conditions  have  been  met  and  whether  the  question  may  be  an¬ 
swered  in  the  affirmative.  We  have  also  addressed  the  issue  of  ambiguity  inherent  in 
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many  questions  concerning  possibility  and  shown  that  our  system  can  respond  in  a 
manner  which  permits  the  user  to  choose  the  interpretation  closest  to  her  intentions. 

We  have  also  examined  the  situation  where  the  answers  to  modal  questions  are 
qualified  by  imposing  one  or  more  conditions  and  the  way  in  which  different  condition 
scenarios  may  be  presented  to  the  user.  This  further  enhances  the  user’s  understanding 
of  the  relationships  inherent  in  the  database  which  determine  the  system’s  response, 
thereby  providing  further  information  which  is  helpful  in  guiding  the  query  process. 

Work  is  currently  being  carried  out  on  implementing  a  more  comprehensive  system 
to  incorporate  a  greater  range  of  query  constructs  and  operators. 
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Abstract.  Observing  the  world  and  finding  trends  and  relations  among  the 
variables  of  interest  is  an  important  and  common  learning  activity.  In  this  paper 
we  apply  TETRAD,  a  program  that  uses  Bayesian  networks  to  discover  causal 
rules,  and  C4.5,  which  creates  decision  trees,  to  the  problem  of  discovering 
relations  among  a  set  of  variables  in  the  controlled  environment  of  an  Artificial 
Life  simulator.  All  data  in  this  environment  are  generated  by  a  single  entity 
over  time.  The  rules  in  the  domain  are  known,  so  we  are  able  to  assess  the 
effectiveness  of  each  method.  The  agent’s  sensings  of  its  environment  and  its 
own  actions  are  saved  in  data  records  over  time.  We  first  compare  TETRAD 
and  C4.5  in  discovering  the  relations  between  variables  in  a  single  record.  We 
next  attempt  to  find  temporal  relations  among  the  variables  of  consecutive 
records.  Since  both  these  programs  disregard  the  passage  of  time  among  the 
records,  we  introduce  the  flattening  operation  as  a  way  to  span  time  and  bring 
the  variables  of  interest  together  in  a  new  single  record.  We  observe  that 
flattening  allows  C4.5  to  discover  relations  among  variables  over  time,  while  it 
does  not  improve  TETRAD’S  output. 


1  Introduction 

In  this  paper  we  consider  the  problem  of  discovering  relations  among  a  set  of 
variables  that  represent  the  states  of  a  single  system  as  time  progresses.  The  data  are  a 
sequence  of  temporally  ordered  records  without  a  distinguished  time  variable.  Our 
aim  is  identify  as  many  cases  as  possible  where  two  or  more  variables’  values  depend 
on  each  other.  Knowing  this  would  allow  us  to  explain  how  the  system  may  be 
working.  We  may  also  like  to  control  some  of  the  variables  by  changing  other 
variables.  We  use  data  from  a  simple  Artificial  Life  [6]  domain  because  it  allows  us  to 
verify  the  results,  and  thus  compare  the  effectiveness  of  the  algorithms. 

Finding  associations  among  the  observed  variables  is  considered  a  useful 
knowledge  discovery  activity.  For  example,  if  we  observe  that  (jc  =  5)  is  always  true 
when  (y  =  2),  then  we  could  predict  the  value  of  y  as  2  when  we  see  that  x  is  5. 
Alternatively,  we  could  assume  that  we  have  the  rule:  if  {(x  =  5)}  then  (y  =  2),  and 
use  it  to  set  the  value  of  y  to  2  by  setting  the  value  of  x  to  5.  Some  researchers  [1,  10] 
have  tried  to  find  the  stronger  notion  of  causality  among  the  observed  variables.  In  the 
previous  example,  they  may  call  x  a  cause  of  y. 
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In  this  paper  we  consider  two  approaches  to  the  problem  of  finding  relations 
among  variables.  TETRAD  [9]  is  a  well-known  causality  miner  that  uses  Bayesian 
networks  [3]  to  find  causal  relations.  One  example  of  the  type  of  rules  discovered  by 
TETRAD  is  x  — »  y,  which  means  that  x  causes  y.  From  the  examples  in  [1,  10],  it 
appears  that  Bayesian  networks  discover  more  causal  relations  than  actually  exist  in 
the  domain.  Bayesian  networks  find  causality  even  in  domains  where  the  existence  of 
causal  relations  itself  is  a  matter  a  debate.  For  words  in  political  texts,  Bayesian 
networks  find  rules  such  as  "Minister "  is  caused  by  "Prime"  [10].  This 
suggests  that  there  is  a  considerable  amount  of  disagreement  about  the  concept  of 
causality.  There  are  ongoing  debates  about  the  suitability  of  using  Bayesian  networks 
for  mining  causality  [2,  4,  5,  11].  Here  we  apply  TETRAD  to  identify  relationships 
between  variables  without  claiming  that  all  of  them  are  causal  relationships. 

C4.5  [8]  creates  decision  trees  that  can  be  used  to  predict  the  value  of  one  variable 
from  the  values  of  a  number  of  other  variables.  A  decision  tree  can  easily  be 
converted  to  a  number  of  rules  of  the  form  if  {(x  =  a)  AND  (y  =  j3)}  then  (z  =  y).  The 
variables  x  and  y  may  be  causing  the  value  of  y,  or  they  may  be  associated  together 
because  of  some  other  reason.  C4.5  makes  no  claim  about  the  nature  of  the 
relationship. 

Both  these  programs  ignore  any  temporal  order  among  the  records,  while  in  the 
data  that  we  use  there  does  exist  relations  among  the  variables  in  consecutive  records, 
in  the  sense  that  the  values  of  some  variables  in  a  record  affect  the  values  of  variables 
in  later  records.  We  will  describe  the  method  used  to  overcome  this  problem. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  the  simple 
environment  that  we  chose  for  testing  the  different  methods.  In  Section  3  we  first 
compare  the  results  obtained  from  TETRAD  and  C4.5  when  there  is  an  association 
among  the  variables  of  a  record,  but  no  causality.  After  that  we  attempt  to  discover 
temporal  relations  among  the  records.  Section  4  concludes  the  paper. 


2  An  Agent’s  View  of  Its  Environment 

We  use  an  Artificial  Life  simulator  called  URAL  [12]  to  generate  data  for  the 
experiments.  URAL  is  a  discrete  event  simulator  with  well  known  rules  that  govern 
the  artificial  environment.  There  is  little  ambiguity  about  what  causes  what.  This 
helps  to  judge  the  quality  of  the  discovered  rules. 

The  world  in  URAL  is  made  of  a  two  dimensional  board  with  one  or  more  agents 
(called  creatures  in  Artificial  Life  literature)  living  in  it.  An  agent  moves  around  and 
if  it  finds  food,  eats  it.  Food  is  produced  by  the  simulator  and  placed  at  positions  that 
are  randomly  determined  at  the  start  of  each  run  of  the  simulator.  There  is  a  maximum 
for  the  number  of  positions  that  may  have  food  at  any  one  time,  so  a  position  that  was 
determined  as  capable  of  having  food  may  or  may  not  have  food  at  a  given  time.  The 
agent  can  sense  its  position  and  also  the  presence  of  food  at  its  current  position.  At 
each  time-step,  it  randomly  chooses  to  move  from  its  current  position  to  Up,  Down, 
Left,  or  Right.  It  cannot  get  out  of  the  board,  or  go  through  the  obstacles  that  are 
placed  in  the  board  by  the  simulator.  In  such  cases,  a  move  action  will  not  change  the 
agent’s  position.  The  agent  can  sense  which  action  it  takes  in  each  situation.  The  aim 
is  to  learn  the  effects  of  its  actions  at  each  particular  place. 
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URAL  employs  Situation  Calculus  [7]  to  build  graphs  with  observed  situations  as 
the  nodes  and  the  actions  as  transition  arcs  between  the  situations.  Agents  use  the 
graphs  to  store  their  observations  of  the  world  and  to  make  plans  for  finding  food. 
URAL  differentiates  between  volatile  and  non-volatile  properties  of  a  situation.  The  x 
and  y  positions  are  non-volatile,  and  are  used  to  distinguish  among  the  situations.  The 
presence  of  food  is  a  volatile  property,  which  means  that  the  same  situation  can  be  in 
different  states.  The  creature  only  keeps  the  last  observed  state  of  a  situation.  URAL 
was  modified  for  this  experiment  to  log  each  encountered  situation  in  a  file. 

For  our  agent,  time  passes  in  discrete  steps.  At  each  time  step,  it  takes  a  snapshot  of 
its  sensors  and  randomly  decides  which  action  it  should  perform.  This  results  in 
records  such  as  <x  position,  y  position,  is  food  here?,  action>.  C4.5  treats  the  last 
variable  in  a  record  as  the  decision  attribute,  so  if  necessary  the  variables  are 
rearranged  in  the  log  file.  Figure  1  shows  two  example  sequences  of  records.  In 
Figure  1(a)  the  last  variable  is  the  action,  while  in  1(b)  it  is  the  x  position.  Here  time 
passes  vertically,  from  top  to  bottom. 


<x,y,f,a> 

<y,f  a,  x> 

<1,3,  false,  L> 

<3,  false,  L,  1> 

<0,  3,  false,  L> 

<3,  false,  L,  0> 

<0,  3,  true,  D> 

<3,  true,  D,  0> 

<0,  4,  false,  U> 

<4,  false,  U,  0> 

<0,  3,  true,  D> 

<3,  true,  D,  0> 

(a)  (b) 

Fig.  1.  Two  example  sequences  of  records 


The  agent,  moving  randomly  around,  can  visit  the  same  position  more  than  once. 
Unlike  the  real-world  data  studied  by  many  people  [1,  2,  10],  here  we  can  reliably 
assume  a  temporal  order  among  the  saved  records,  as  each  observation  follows  the 
previous  one  in  time.  Considering  the  similarities  between  data  gathered  by  an  agent 
and  the  statistical  observations  done  by  people  for  real-world  problems,  it  is 
interesting  to  see  if  we  can  use  the  same  data  mining  techniques  to  extract  knowledge 
about  this  environment. 

The  agent  can  move  around  in  this  very  simple  world,  so  it  has  a  way  of  changing 
its  position  by  performing  a  move  action.  Creating  food,  on  the  other  hand,  is 
completely  beyond  its  power.  Finding  the  effects  of  the  agent’s  actions  requires 
looking  at  more  than  one  record  at  a  time  (two  consecutive  records  in  this 
environment),  because  in  this  environment  the  effects  of  an  action  always  appears  at  a 
later  time.  Any  algorithm  that  does  not  consider  the  passage  of  time  is  limiting  itself 
in  finding  causal  rules.  So  this  domain  contains  causal  relations  detectable  by  the 
agents  over  time  (the  effects  of  moving),  as  well  as  relations  that  are  not  detectable  by 
the  agent  (the  place  of  food). 

3  Experimental  Results 

The  effectiveness  of  TETRAD  and  C4.5  at  finding  valid  relations  is  assessed  in  two 
situations:  within  a  single  record  and  within  consecutive  pairs  of  records. 
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3.1  Experiment  1:  Relationships  within  a  Single  Record 

From  the  semantics  of  the  domain,  we  know  that  no  causal  relationship  exists  within  a 
single  record.  Causal  relations  appear  across  the  records.  However,  there  is  an 
association  between  the  x  and  y  position  of  an  agent  and  the  presence  of  food  at  that 
position,  as  the  simulator  places  food  at  only  certain  places.  A  position  may  or  may 
not  contain  food  at  any  given  time.  We  created  a  log  file  of  the  first  1000  situations 
encountered  by  a  single  agent  and  used  it  for  the  experiments. 

We  first  fed  the  log  file  to  TETRAD  version  3.1.  In  TETRAD’s  notation,  A  •-»  B 
means  that  either  A  causes  B,  or  they  both  have  a  hidden  common  cause,  A  B 
means  that  A  causes  B  or  B  causes  A,  or  they  both  have  a  hidden  common  cause,  and 
A  <r^B  means  that  both  A  and  B  have  a  hidden  common  cause. 

TETRAD  would  not  accept  more  than  8  different  values  for  each  variable,  so  the 
world  was  limited  to  an  8  x  8  square  with  no  obstacles.  In  the  log  file,  x  and  y  denote 
the  agent's  position, /denotes  the  presence  of  food,  and  a  is  the  performed  action.  The 
presence  of  food  and  the  actions  were  represented  by  numerical  values  to  make  them 
compatible  with  what  TETRAD  expects  as  input.  The  results  were  generated  by  the 
"Build"  command.  It  did  assume  the  existence  of  latent  common  causes,  and  used  the 
exact  algorithm.  TETRAD'S  output  is  shown  in  Table  1. 


Table  1.  The  mles  discovered  by  TETRAD. 


Case 

Significance  Level(s) 

Discovered  Rules 

m 

ISM 

El 

1 

0.0001 

y  •-»*,  f,a 

l 

1 

2 

0.001,  0.005,  0.01 

a  *—>y ,  x  •-•a,  x  »->y,  / »—>y 

m 

2 

2 

3 

0.05,  0.1 

x  •-•y,  /•->*,  a  •-tx,  f  a  •— >y 

a 

mm 

H 

H 

0.2 

x  •-•y,  x  x  •-•a,  y  •-•/ 

y  •-•a,  /  a 

H 

■ 

3 

Based  on  the  rules  enforced  by  URAL  in  the  artificial  environment,  the  desired 
output  in  TEDRAD’s  notation  would  be  the  following  relations:  x  /  (x  and /are 
associated),  y  /  (y  and  /  are  associated),  /  (f  has  no  cause  and  does  not  cause 
others).  The  relations  that  are  not  totally  wrong  are  shown  in  bold.  In  the  table  the  C 
column  indicates  the  number  of  Correct  rules,  the  PC  column  show  the  number  of 
Partially  Correct  (non-conclusive)  rules,  and  W  shows  the  number  of  Wrong  rules. 

The  appearance  of  food  at  a  certain  position  depends  on  a  random  variable  inside 
the  URAL  code,  and  there  is  no  causal  relation  that  the  agent  can  discover,  but  there  is 
an  association  between  positions  and  the  presence  of  food. 

In  case  1  the  rule  a  •— >  x  correctly  guesses  that  a  may  be  a  cause  of  x.  However 
this  is  not  conclusive  in  the  sense  that  it  considers  it  possible  for  a  hidden  common 
cause  to  exist.  This  is  an  important  distinction.  In  the  next  rule /,  TETRAD  identifies/ 
as  something  that  does  not  cause  anything,  and  is  not  caused  by  anything  else.  The 
other  rule,  y  •—>  x,  is  wrong  because  the  y  coordinate  of  a  position  does  not  determine 
the  x  coordinate.  Case  2  does  not  go  wrong  in  the  first  two  rules,  even  though  none  of 
them  is  conclusive.  The  other  two  rules  are  wrong.  Case  3  does  better  in  finding  the 
relationships  among  a  and  x,  and  a  and  y,  but  the  results  are  still  not  conclusive.  Case 
5  finds  associations  among  x,  y  and /  but  then  wrongly  does  the  same  for  a  and/too. 
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As  seen,  TETRAD  draws  many  wrong  conclusions  from  the  data,  and  with  the 
exception  of  one  rule  (/),  the  rest  are  not  conclusive.  Notice  that  here  we  have  been 
generously  interpreting  the  rules  involving  a  as  if  TETRAD  is  aware  that  an  action 
will  have  an  effect  on  the  next  value,  and  not  the  current  value,  of  x  or  y. 

We  then  tried  the  c4.5rules  program  of  the  C4.5  package  in  the  default 
configuration,  on  the  same  data.  We  assigned  the  presence  of  food  as  the  decision 
attribute.  C4.5rules  eliminates  unneeded  condition  attributes  when  creating  rules.  For 
example,  if  the  value  of  x2  is  sufficient  to  predict  the  outcome  regardless  of  the  value 
y2,  the  generated  rule  will  not  include  y2.  We  are  looking  for  rules  of  the  form  if  { (x  = 
ci )  AND  (y  =  0)}  then  (f  =  f),  which  means  that  there  is  a  relation  between  the 
position  (x  and  y)  and  the  presence  of  food. 

Table  2  shows  the  c4.5rules  program’s  results  for  determining  the  value  of  /.  Rules 
that  are  actually  and  useful  for  finding  food  are  shown  in  bold.  The  rules  predicting 
the  presence  of  food  correctly  included  both  x  and  y.  The  rest  of  the  rules  deal  with 
cases  where  no  food  was  present. 


Table  2.  Attributes  used  in  rules  generated  by  C4.5  to  determine  the  presence  of  food. 


Decision 

Attribute 

Condition 

Attribute(s) 

Number 
of  Rules 

Example 

_ 

Correctness 

/ 

X 

1 

Miiihii  iiiib 

Correct 

/ 

y 

2 

Iff  (y  =  2)}  then  (/  =  0) 

Correct 

/ 

a 

1 

Iff  (a  =  L)  }then  {f  =  0) 

Wrong 

/ 

x,y 

2 

If{(r  =  2)  AND  (y  =  3)} 
then  (/■=!) 

Correct 

C4.5  was  unable  to  find  useful  rules  for  determining  the  values  of  x  or  y,  as  they 
depend  on  the  previous  position  and  action,  which  are  not  available.  The  decision  tree 
for  x,  for  example,  wrongly  included  the  presence  of  food  as  a  condition  attribute. 

3.2  Experiment  2:  Relationships  among  Consecutive  Records 

As  mentioned,  the  log  file  consists  of  temporally  ordered  records.  Neither  C4.5  nor 
TETRAD  considers  the  temporal  order  and  adding  a  simple  discrete  time  stamp  as  an 
attribute  will  not  allow  them  to  find  temporal  relationships.  With  the  semantics  of  our 
example  domain  in  mind,  the  most  one  can  hope  for  in  the  previous  tests  is  finding  a 
correct  association  between  the  agent’s  position  and  the  presence  of  food. 

The  effects  of  an  action  will  not  be  seen  until  later  in  time.  Using  a  preprocessing 
step,  a  flattened  log  file  is  created  with  two  or  more  consecutive  records  as  a  single 
record.  Flattening  the  sequences  in  Figure  1(a)  and  1(b)  using  a  time  window  of  size  2 
gives  the  sequences  sown  in  Figure  2(a)  and  2(b)  respectably,  where  time  passes 
horizontally  from  left  to  right  and  also  vertically  from  top  to  bottom.  Here  we  have 
renamed  the  variables  to  remove  name  clashes.  With  the  exception  of  the  first  and  last 
records,  every  record  appears  twice,  once  as  the  second  half  (effect),  and  then  as  the 
first  half  (cause)  of  a  combined  record. 

The  appropriate  size  of  the  time  window  depends  on  the  domain.  It  should  be  wide 
enough  to  include  any  cause  and  all  its  effects.  If  we  suspect  that  the  effects  of  an 
action  will  be  seen  in  the  next  two  records,  then  we  may  flatten  Figure  1(a)  to  get 
records  like:  <1,  3,  false,  L,  0,  3,  false,  L,  0,  3,  true,  D>.  The  algorithms  accepting  the 
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flattened  records  as  input  may  not  know  about  the  passage  of  time,  but  flattening 
brings  the  causes  and  the  effects  together,  and  the  resulting  record  then  has  the 
information  about  any  changes  in  time. 


<x„ 

y„Z,  a„  x„  y„/„  a,  > 

<1,3, 

false,  L,  0,  3,  false,  L  > 

<0,3 

false,  L,  0,  3,  true,  D  > 

<0,  3, 

true,  D,  0,  4,  false,  U  > 

<0,  4, 

false,  U,  0,  3,  true,  D  > 

(a) 


<y  ../ .  a..  .v .  y-/..  «,.x,> 
<3,  false,  L,  1,3,  false,  L,  0  > 
<3,  false,  L,  0,  3,  true,  D,  0  > 
<3,  true,  D,  0,  4,  false,  U,  0  > 
<4,  false,  U,  0,  3,  true,  D,  0  > 
(b) 


Fig.  2.  The  flattened  sequences  of  Figures  1(a)  and  1(b). 


For  the  next  experiment,  a  window  size  of  2  was  used  because  in  the  URAL 
domain  the  effects  of  an  action  are  perceived  by  the  agent  in  the  next  situation.  In  the 
combined  record,  x,,  yt,ft  and  a,  belong  to  the  first  record,  and  xv  y2,f2  and  a,  belong 
to  the  next  one.  TETRAD'S  output  for  the  resulting  data  is  shown  in  Table  3. 


Table  3.  The  rales  discovered  by  TETRAD  from  the  flattened  records. 


Case 

Significance  Level(s) 

Discovered  Rules 

B 

KM 

1 

0.0001 

y,  •— >x,,  a,  »->x,,  x,  •-•x2,  y2  «->x,, 

a2  *->*!.  f|  Z  •-^3’r  «, 

a,  ^>yp  a2  y2  •->x2,  a2  «->x2,  fv 

Z 

2 

4 

8 

2 

0.0005,  0.001, 
0.005,  0.01 

y,  «->x,,  x,  x,  •-•x2,  x,  y2, 

a2  y,  •-»«,.  y,  •->*2,  y,  *-»y2, 

«,  a,  •-•y2,  a2  •-»  a,,  x2  •-•y2, 

a,  •~>xv  a2  »^>y2,  /„  /„ 

2 

1 

10 

3 

0.1 

x,  •->>,,  x,  •-•a,,  x,  •~>x2,  x,  y2, 

X,  »->y,,  a,  •-»yl,  y,  <->x2, 

y2  a2  •-»y,> «,  «,  •-•y2, 

a,  •-•av  y2  »->x2,  /2  »->x2,  a2  »->x2, 

y, «-«a2 

0 

1 

14 

1 

0.2 

x,  y,,  Z  •— >Xj,  X,  <->«,,  X,  <->X2, 
y2  •~>XV  a2  •  >  x,,  Z  »->yL  y, 

y,  ^x2,  y2  •— »y,,  a2  •->y1,  flj  •-•x2, 
y2  •-»  a,,  ft  •->  a,-  «2  •“»  ai-  y2  •~»x2. 

z  *  ^  x,,  a,  •->x„  y2  a, 

0 

1 

18 

Here  we  are  looking  for  the  following  relations:  x,  (x,  and/,  are  associated), 
y,  /  (y,  and  /,  are  associated),  ft,  f2  (/j  and  f2  have  no  causes  and  do  not  cause 
anything),  x2  /  (x2  and/2  are  associated),  y2  f2  (y2  and  f2  are  associated),  a,  -»x2 
(a,  causes  x2),  a,  — >  y,  (a,  causes  x,),  x,  — >  x2  (x,  causes  x2),  and  y,  — >  y2  (y,  causes  y2). 
The  relations  that  are  not  totally  wrong  are  shown  in  bold. 

Most  rules  discovered  by  TETRAD  on  this  data  are  wrong.  In  comparison  with 
Experiment  1,  the  increased  number  of  variables  in  Experiment  2  has  resulted  in  an 
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increase  in  the  number  of  discovered  rules,  most  of  which  are  either  wrong  or  not 
conclusive. 

We  applied  C4.5  to  the  same  data.  The  desired  output  rules  are  of  the  following 
forms:  if  {(x,  =  a)  AND  (y2-  J3)}  then  (f2=  f)  (Association  between  x,  y,  and  food),  if 
{(xt  =  a)  AND  (a,  =  J3)}  then  (x2  =  f)  (predicting  the  next  value  of  x),  and  if  {(y,  =  (K) 
AND  (a,  =  p)}  then  (y2  =  f)  (predicting  the  next  value  of  /).  The  results  produced  by 
c4.5rules  are  shown  in  Table  4. 


Table  4.  C4.5’s  results  after  flattening  the  records. 


Decision  Attribute 

Condition  Attribute(s) 

Number  of  Rules 

/, 

^ 

1 

Correct 

/, 

m  m . 

2 

Correct 

f. 

a. 

1 

Wrong 

A 

2 

Correct 

X, 

32 

Correct 

y , 

y„a. 

32 

Correct 

Rules  that  can  actually  be  used  for  finding  and  reaching  food  are  shown  in  bold. 
C4.5rules  generated  32  correct  rules  for  each  of  x2  and  yr  In  a  two-dimensional  space, 
there  are  4  possible  actions  and  8  distinct  values  for  x,  and  yv  In  this  example  the 
creature  has  explored  all  the  world,  and  there  are  32  (8  x  4)  rules  for  predicting  the 
next  value  of  x2  and  yr  There  were  no  changes  in  the  rules  for  f2  (the  actually  useful 
rules  that  predict  the  presence  of  food  still  depend  on  both  x2  and  y2)  even  though  C4.5 
now  has  more  variables  to  choose  from.  This  is  because  the  current  value  of  f2  is  not 
determined  by  any  temporal  relationship.  Overall,  C4.5  did  a  much  better  job  in 
pruning  the  irrelevant  attributes  than  TETRAD. 


4  Concluding  Remarks 

People  interested  in  finding  relations  among  observed  variables  usually  gather  data 
from  different  systems  at  the  same  time.  Here  we  used  data  that  represented  the  state 
of  a  single  system  over  time.  While  there  were  relations  among  the  variables  in  each 
state,  some  interesting  temporal  relations  existed  among  the  variables  of  different 
states. 

We  applied  a  causality  miner  that  uses  Bayesian  networks  to  find  relations  in  a 
very  simple  and  well-defined  domain.  The  results  were  similar  to  the  real-world 
problems:  both  correct  and  wrong  rules  were  found.  Bayesian  causality  miners  need  a 
domain  expert  (a  more  powerful  causal  relation  discoverer)  to  prune  the  output. 
Flattening  the  records  to  give  the  algorithm  more  relevant  data  resulted  in  many  more 
irrelevant  rules  being  discovered.  We  also  tested  C4.5  on  the  same  data,  and  observed 
that  it  is  very  good  in  pruning  non-relevant  attributes  of  the  records  and  finding 
temporal  relations  without  actually  claiming  to  be  a  causality  discoverer.  One 
consideration  with  C4.5  is  that  the  user  has  to  identify  the  decision  attribute  of 
interest. 
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We  observed  that  flattening  enabled  C4.5  to  discover  new  relations  that  it  could  not 
find  otherwise.  Flattening  increases  the  number  of  variables  in  the  resulting  records. 
While  creating  a  bigger  search  space  for  rule  mining,  flattening  gives  more 
information  to  the  rule  miner,  and  makes  temporal  relations  explicit. 
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Abstract.  Top-down  learners  suffer  often  from  the  plateau  problem  (or  myopia) 
of  their  greedy  search  algorithms.  One  way  to  address  this  is  to  extend  the  top- 
down  greedy  search,  which  grows  the  clauses,  with  relational  cliches.  Using  cli¬ 
ches  the  search  is  no  longer  constrained  to  adding  one  literal  at  a  time:  combi¬ 
nations  of  literals  instantiating  cliches  are  tried  as  well.  The  paper  presents 
CLUSE:  Cliches  Learned  and  USEd,  a  system  that  learns  cliches  that  are  then 
used  either  within  a  domain,  or  across  domains.  CLUSE  is  a  bottom-up  learner, 
in  which  generalization  proceeds  according  to  Contextual  LGG  (CLGG).  CLGG 
is  an  extension  of  LGG  that  takes  into  account  the  context  in  which  a  pair  of  lit¬ 
erals  is  generalized.  The  paper  defines  CLGG,  illustrates  how  cliches  are 
learned,  and  shows  that  the  complexity  of  this  learning  is  polynomial. 


1  Introduction 

Inductive  learners  that  use  a  first-order  language  to  express  examples,  background 
knowledge  and  hypotheses  (or  concept  descriptions)  are  called  inductive  relational 
learners.  Because  they  induce  hypotheses  in  the  form  of  logic  programs  they  are  also 
called  inductive  logic  programming  (ILP)  systems.  Top-down  inductive  relational 
learners  such  as  FOIL  [13]  and  FOCL  [10]  learn  Horn  clauses  adding  one  literal  at  a 
time  using  a  greedy-search  algorithm.  At  each  step  the  coverage  of  the  rule  after  add¬ 
ing  a  literal  is  tested  on  training  examples.  The  literal  that  best  discriminates  the  re¬ 
maining  positive  and  negative  examples  is  added  to  the  current  clause.  The  clause  is 
complete  when  it  no  longer  covers  negative  examples.  Top-down  systems  suffer  from 
myopia,  which  arises  when  the  best  discrimination  would  be  obtained  by  adding  more 
than  one  literal  at  once.  Solving  the  problem  requires  searching  for  combinations  of 
literals  rather  than  just  single  literals.  Unfortunately,  trying  all  possible  combinations 
of  literals  can  be  intractable.  A  mechanism  to  search  efficiently  through  the  space  of 
combinations  of  literals  is  needed.  A  learner  can  be  provided  with  such  a  mechanism 
in  form  of  a  special-purpose  bias. 

We  propose  CLUSE  ( Cliches  Learned  and  Used)  [6]  to  learn  combinations  of  liter¬ 
als  automatically  as  a  particular  type  of  bias.  These  combinations  of  literals  are  called 
relational  cliches.  The  underlying  idea  is  to  learn  cliches  from  examples  of  a  concept 
and  to  use  them  within  and  across  domains  ().  Assuming  that  cliches  express  subcon¬ 
cepts  common  to  a  domain,  and  that  in  the  same  domain  literals  used  to  express  differ 
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ent  concepts  overlap,  then  cliches  learned  from  one  concept  should  provide  appropri¬ 
ate  lookahead  to  learn  other  concepts  in  the  same  domain.  On  the  other  hand,  these 
cliches  probably  have  few  literals  in  common  with  concepts  in  other  domains,  hence 
the  need  for  more  general  cliches.  To  solve  this,  CLUSE  learns  two  kinds  of  cliches: 
Domain  Dependent  Cliches  (DDCs)  expressed  as  a  conjunction  of  literals  specific  to  a 
domain,  and  Domain  Independent  Cliches  (DICs)  where  literals  have  variable  predi¬ 
cate  symbols  (hence  they  are  notspecific  to  a  domain).  When  DICs  are  transferred 
across  domains  they  are  instantiated  with  literals  in  the  domain  of  the  target  concept. 

CLUSE  is  a  bottom-up  inductive  relational  learner  based  on  Relative  Least  General 
Generalization  (RLGG)  [12]remedy  the  inefficiency  and  the  overgeneralization  prob¬ 
lems  of  RLGG,  we  have  also  developed  a  modified  version  of  RLGG  that  exploits  the 
context  in  which  LGG  is  applied.  The  modified  RLGG  is  called  Contextual  Least 
General  Generalization  (CLGG). 


DICs 


Figure  1.  CLUSE  learns  relational  cliches  (DDCs  and  DICs)  from  examples  of  a 
concept  in  one  domain.  DDCs  are  useful  to  learn  concept  within  the  same  domain, 
whereas  DICs  are  useful  across  domains. 

This  paper  describes  the  learning  of  relational  cliches  with  CLUSE.  It  introduces 
CLGG,  the  similarity  measures,  and  the  notion  of  chains  used  in  CLUSE.  It  describes 
CLUSE’ s  algorithm  and  its  complexity.  How  these  cliches  address  the  myopia  prob¬ 
lem  of  an  inductive  relational  learner  is  described  in  [7], 


2.  CLGG:  Contextual  Least  General  Generalization 

Much  of  the  existing  work  in  learning  in  the  first-order  logic  setting  is  based  on  Plot- 
kin’s  Least  General  Generalization  LGG  [11],  and  on  its  extension  -  called  Relative 
LGG  (RLGG)  [12],  From  the  machine  learning  perspective,  however,  there  are  certain 
practical  shortcomings  of  the  LGG  approach  to  generalization.  First,  in  the  worst  case, 
the  cost  of  applying  LGG  on  two  clauses  is  equal  to  the  length  of  the  first  clause  times 
the  length  of  the  second  one.  So  the  cost  of  applying  LGG  on  a  set  of  clauses  is  expo¬ 
nential  in  the  number  of  literals  to  generalize.  Second,  additional  knowledge  (e.g. 
taxonomic  hierarchies)  is  often  available  during  generalization.  Many  learning  meth¬ 
ods  take  such  knowledge  into  account  in  the  generalization  process  [1,  10].  LGG  does 
not  use  any  background  knowledge  (BK)  in  the  generalization  process.  RLGG  is  sup¬ 
posed  to  address  knowledge-driven  generalization,  but  since  RLGG  compiles  all  the 
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knowledge  during  generalization  in  the  form  of  additional  literals,  this  compounds  the 
efficiency  problems  of  LGG. 

The  learning  system  GOLEM  [9]  is  based  on  RLGG  but  reduces  the  cost  by  con¬ 
straining  the  BK  to  a  finite  Herbrand  model.  Even  using  a  finite  model,  the  length  of 
the  LGGi  is  exponential  in  the  number  of  given  examples.  By  using  restrictions  like 
the  y-determinism  and  syntactically  generative 1  background  clauses,  the  length  of  the 
LGG  of  a  set  of  examples  no  longer  depends  on  the  number  of  examples.  ITOU  [15], 
CLINT  [3],  and  Kodratoff’ s  system  [4]  encounter  the  same  problem  of  efficiency  as 
RLGG,  generating  the  Herbrand  models  of  the  BK  (by  an  exhaustive  saturation  proc¬ 
ess).  Other  learning  systems  like  FOCL  [16]  and  CIGOL  [8]  use  BK,  but  these  are  not 
based  on  RLGG. 

This  section  (see  [6]  for  a  full  presentation)  outlines  an  alternative  to  LGG  that  ex¬ 
ploits  the  context  in  which  LGG  is  applied.  The  context  is  meant  here  to  include  both 
the  additional  knowledge  available  during  generalization,  as  well  as  the  similarity  of 
literals  in  the  context  of  the  clauses  being  generalized.  Although  CLGG  is  defined  for 
literals  with  nested  arguments  involving  functors,  in  this  presentation  we  limit  our¬ 
selves  to  simpler  functor-free  arguments. 

To  extend  LGG  so  that  context  is  taken  into  account  during  generalization,  the 
similarity  between  every  pair  of  constants  (or  bindings)  occurring  in  the  same  relation 
in  the  two  clauses  being  generalized  is  computed  before  the  generalization.  The  simi¬ 
larity  of  constants  takes  into  account  their  occurrences  in  clauses  (see  below).  Con¬ 
stant  bindings  with  a  similarity  higher  than  a  threshold  are  bound  to  a  variable.  The 
constants  and  the  variable  constitute  the  similarity  bindings,  which  are  then  passed  to 
the  CLGG  to  limit  its  search.  When  generalizing  two  clauses,  literals  that  match 2  (even 
with  multiple  occurrences)  must  have  at  least  one  similarity  binding  to  be  generalized. 
Moreover,  to  take  into  account  the  context,  in  which  the  generalization  of  two  clauses 
takes  place  and  to  address  the  shortcomings:  of  RLGG,  the  BK  is  used  in  a  lazy  man¬ 
ner.  Only  unmatched  literals  restricted  with  similarity  bindings  that  find  a  generaliza¬ 
tion  in  the  BK  are  generalized. 


2.1  Similarity  Measure  Evaluates  Bindings  of  Constants 

We  borrowed  the  similarity  measure  from  Bisson  [2]  to  evaluate  the  bindings  of  con¬ 
stants  in  clauses.  For  each  constant  a/  a  list  of  occurrences  (denoted  by  occ(aj)  is 
made  of  pairs  (predicate-ai,  position-of-a\)  for  each  literal  where  the  constant  occurs. 
The  predicate-ai  is  the  literal’s  predicate  and  the  position-of-ai  is  the  term’s  position 
among  the  arguments  of  predicate-ar 

Two  constants  match  if  they  occur  in  two  literals  whose  predicates  are  identical. 
The  similarity  between  two  constants  from  two  clauses  is  the  ratio  of  the  length  of  the 
lists  of  common  occurrences  to  the  maximum  length  of  constants’  occurrences  in  the 


1  A  clause  is  said  to  be  syntactically  generative  if  the  variables  in  its  head  are  a  subset  of  the 
variables  in  its  body. 

2  Two  literals  match  if  they  have  the  same  predicate  and  arity. 
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two  clauses.  It  results  in  a  value  between  [0..1],  where  the  closer  the  value  gets  to  1, 
the  more  similar  the  constants  are.  The  similarity  measure  formula  is: 

sim(ai,  aj)  =  _ lennthiocda;)  n  occfaji) _ 

MA X(length(occ( a0),  length(occ( aj))) 

The  overall  idea  of  this  definition  is  that  constants  occurring  in  two  clauses  are 
similar  if  they  occur  in  a  similar  enough  context.  It  is  required  that  occurrences  of 
constants  in  relations  (i.e.  literals  with  more  than  one  argument)  match,  otherwise  their 
similarity  binding  is  zero3. 


2.2  CLGG  Relative  to  a  BK 

As  in  RLGG,  the  CLGG  (in  the  functor-free  case)  of  two  clauses  Cl  and  C2  (denoted 
CLGG(C1,  C2))  is  the  least  upper  bound  (or  least  general  generalization)  of  Cl  and 
C2  in  the  ^subsumption  lattice.  Unlike  RLGG,  CLGG  exploits  an  intensional  BK  in  a 
lazy  way.  It  first  generalizes  clauses  and  then  uses  the  BK  to  generalize  unmatched 
literals  that  have  at  least  one  similarity  binding. 

Figure  2  illustrates  the  extent  to  which  similarity  bindings  limit  CLGG  search  when 
there  exist  multiple  examples  of  the  same  predicate  in  clauses  (black/ 1).  Unlike 
RLGG,  CLGG  limits  the  generalization  of  the  predicate  black  to  combinations  with 
one  similarity  binding  ( e.g .  VI  and  V2).  It  results  in  a  generalization  (above  (VI, 
V2 ) ,  black (VI) ,  black ( V2 ) )  and  unmatched  literals  tri(a) ,  rect(b) ,  and 
small  (b)  for  Cl,  and  sq (c)  ,  sq (d)  for  C2.  When  a  literal  from  the  BK  subsumes 
unmatched  literals  with  a  similarity  binding,  then  the  subsuming  literal  is  added  to  the 
generalization  and  unmatched  literals  are  discarded.  For  instance,  Poly  (VI)  is  added 
to  the  generalization  because  it  subsumes  tri  ( a )  and  sq  ( c ) . 

Cl:  scene (a,  b):-  above (a,  b)  ,  tri (a),  rect(b),  small (b) , 
black(a),  black(b). 

C2:scene(c,  d):-  above(c,  d)  ,  sq(c),  sq(d) ,  black(c), 
black (d) . 

CLGG(C1,  C2)  with  BK: 

=$•  scene (VI ,V2 ): -  above (VI,  V2) ,  black (VI),  black (V2), 

Poly (VI),  Rect (V2 ) . 

Generalized  literals:  Poly  (VI)  =  (tri  (a),  sq(c)) 

Rect(V2)  =  (rect(b),  sq(d)) 

Unmatched  literals :  sma  11(b) 

Bindings:  VI  =  (a,  c)  ,  V2  =  (b,  d) 

Figure  2.  CLGG  relative  to  the  BK.  Predicates  with  a  capital  letter  are  learned  using  a 
taxonomy  of  geometric  forms. 


3  CLGG  is  used  to  learn  relational  cliches  where  relations  are  more  important  than  attributes 
[6], 
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3  CLUSE  Learns  Relational  Cliches 

The  generalization  problem  of  learning  cliches  consists  of  finding  common  parts  in 
examples.  In  relational  domains,  examples  are  represented  with  different  relations  and 
features,  which  makes  it  difficult  to  find  a  generalization  process  that  will  succeed  in 
finding  common  parts  of  a  set  of  examples.  Moreover,  in  most  of  these  domains,  im¬ 
portant  concepts  are  represented  by  a  small  number  of  connections  among  constants 
defining  examples.  For  these  reasons,  CLUSE  splits  examples  into  their  shortest 
chains.  This  is  similar  to  the  idea  of  relational  path  finding  [14].  Intuitively,  a  chain  is 
a  pattern  showing  how  objects  are  related  to  one  another  and  how  their  features  are 
used  in  examples.  So,  each  relation  in  an  example  (which  provides  the  structural  in¬ 
formation  between  objects)  and  all  features  of  the  related  objects  form  a  chain.  Every 
relation  and  feature  of  an  example  is  preserved  and  some  features  may  occur  in  more 
than  one  chain.  A  chain  is  thus  defined  as  a  connected  conjunction  of  literals  where 
one  and  only  one  literal  is  a  relation. 

E:  on(x,  y) ,  cir(x),  rect(y),  leftof (y,  z),  iso(z) 

Cl:  on(x,  y) ,  cir(x),  rect(y) 

C2 :  leftof (y,  z),  rect(y),  iso(z) 

Figure  3.  Example  E  expressed  in  terms  of  chains  (Cl  and  C2). 

Figure  3  shows  the  shortest  chains  cl  and  C2  in  the  example  E.  They  are  connected 
combinations  of  literals  with  a  single  relation  (on  (x,  y)  and  leftof  (y ,  z ) ).  Be¬ 
cause  relevant  relations  are  not  always  connected  to  the  head  of  examples,  chains  are 
used  without  the  example’s  head.  For  instance,  both  relations  on(x,  y)  and 
leftof  (y,  z)  would  be  related  to  the  head  scene  (x,  y).  On  the  other  hand, 
leftof  (y,  z)  would  not  be  connected  if  the  head  is  scene  (x) . 


3.1  CLUSE’s  Algorithm 

The  general  algorithm  of  learning  DDCs  and  DICs  with  CLUSE  is  as  follows.  Positive 
and  negative  examples  (and  optionally)  the  BK  are  given  to  CLUSE.  Examples  are 
split  into  chains.  At  the  beginning,  positive  chains  are  considered  roots  of  the  struc¬ 
ture.  CLUSE  evaluates  the  similarity  of  each  pair  of  roots.  It  chooses  the  two  most 
similar  roots  and  generalizes  them  using  CLGG.  The  resulting  generalization  becomes 
the  parent  (and  the  new  root)  of  the  two  generalized  chains.  The  similarity  of  this  root 
with  each  other  root  is  evaluated.  CLUSE  repeats  this  process  until  no  more  generali¬ 
zations  are  possible.  When  taxonomies  are  available  to  CLUSE,  the  most  similar 
chains  with  the  lowest  cost  are  generalized  first.  The  cost  expresses  the  distance  be¬ 
tween  predicates  in  the  taxonomy.  CLUSE  uses  the  similarity  of  clauses  to  choose  the 
two  most  similar  chains  to  generalize  first.  The  similarity  measure  of  two  clauses  is 
computed  from  similarity  bindings  of  constants.  The  formula  is: 

n  m 

sim(Cl,C2)  =  nn  sim(a\i,a2  j) 

i= 1  7=1 


for  sim(al.,  a2.) 
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where  ali  e  Cl  and  a2j  e  C2,  n  is  the  number  of  constants  in  al,  and  m  is  the  num¬ 
ber  of  constants  in  a2. 

A  generalization  is  added  to  the  structure  as  the  parent  of  the  two  chains  that  it  gen¬ 
eralizes.  To  avoid  a  tree  with  duplicate  generalizations  (when  many  chains  result  in 
the  same  generalization),  chains  that  are  exactly  subsumed  by  their  generalization  (i.e. 
they  differ  only  by  their  argument  names)  are  removed.  Every  unmatched  literal 
(when  combined  with  the  generalization)  that  is  exactly  subsumed  by  the  generaliza¬ 
tion  it  accompanies,  is  also  removed. 

Pruning  the  structure  preserves  generalizations  (DDCs)  with  good  coverage  of  ex¬ 
amples  and  discards  others.  CLUSE  traverses  a  structure  depth-first  and  computes 
coverage  frequencies  for  positive  and  negative  examples.  Coverage  frequencies  corre¬ 
spond  to  the  number  of  positive  (or  negative)  examples  subsumed  by  the  generaliza¬ 
tion  divided  by  the  total  number  of  positive  (or  negative)  examples4.  A  generalization 
is  preserved  when  it  covers  fewer  negative  examples  than  the  generalization  that  sub¬ 
sumes  it,  and  satisfies  the  user-defined  coverage  thresholds.  This  way,  the  pruning 
eliminates  generalizations  with  low  recall  and  low  precision. 

DDCs  are  useful  for  learning  concepts  in  the  same  domain  where  they  are  learned. 
On  the  other  hand,  cliches  independent  of  the  domain  are  useful  for  learning  concepts 
in  other  domains.  To  learn  such  cliches,  first-order  predicates  of  DDCs  are  replaced 
with  second-order  predicates  giving  DICs.  The  information  as  to  whether  a  predicate 
of  a  DDC  is  generalized  using  the  BK  or  not  is  preserved  within  the  predicate  name  in 
the  DIC.  A  predicate  that  subsumes  other  predicates  on  the  taxonomy  in  called  an 
intensional  predicate,  whereas  a  predicate  that  belongs  to  examples  is  called  an  exten- 
sional  predicate.  Intensional  predicates  are  generalized  to  a  predicate  variable  of  the 
form  intP#  and  extensional  predicates  to  a  predicate  variable  of  the  form  ExtP#.  So, 
When  DICs  are  used  to  learn  a  concept  in  a  new  domain,  this  information  can  be  re¬ 
covered  and  used  to  instantiate  second-order  predicates  with  intensional  or  extensional 
predicates  of  the  new  domain. 

The  overall  complexity  of  learning  cliches  is  polynomial,  due  to  the  similarity 
evaluation  of  all  chains  0(k2n)  where  k  is  the  maximum  number  of  arguments  for  any 
relation,  and  n  the  number  of  literals  (see  [6]  for  a  full  presentation). 


3.2  Examples  of  Learning  Relational  Cliches 

This  section  illustrates  an  example  of  CLUSE  learning  the  concept  Scene  in  the  blocks 
domain5.  It  shows  the  chains  used  for  learning,  the  structure  of  generalizations,  the 


4  The  frequency  of  coverage  gives  more  flexibility  than  a  measure  like  information  gain  [13]. 
Unlike  information  gain,  the  coverage  frequencies  of  a  generalization  explicitly  represent  the 
proportion  of  subsumed  positive  and  negative  examples.  This  allows  the  user  to  fix  two  dif¬ 
ferent  coverage  thresholds  for  choosing  generalizations.  For  instance,  the  user  may  choose  to 
preserve  only  generalizations  that  cover  at  least  50%  of  the  positive  and  at  most  25%  of  the 
negatives. 

5  CLUSE  has  also  been  used  to  learn  cliches  in  the  real-life  domain  of  the  Finite  Element  Mesh 
Design  (see  http://www.site.uottawa.ca/--imorin/Programs/CLUSE/Output/Mesh/'). 
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DDCs  and  the  DICs  learned.  Scene  is  a  disjunctive  concept  describing  1)  an  ellipse 
above  a  rectangle,  which  is  left  of  an  isosceles  triangle',  2)  an  ellipse  above  an  isosce¬ 
les  triangle  and  a  rectangle  left  of  the  triangle.  In  both  cases  the  ellipse  may  be  small., 
the  rectangle  may  be  large  and  the  triangle  may  be  red.  Moreover,  the  ellipse  can  also 
be  a  circle,  the  rectangle  a  square,  and  the  isosceles  triangle  a  right-angled  isosceles 
triangle. 

Figure  4  illustrates  a  part  of  the  structure  of  generalizations  built  with  CLUSE  for 
the  concept  Scene.  Each  level  includes:  the  generalization,  the  coverage  frequencies  of 
chains,  unmatched  literals  (generalized  or  not)  and  the  positive  chains  subsumed6.  For 
instance,  at  the  lowest  level  of  the  structure,  generalization  G46  subsumes  chains  1,  7 
and  17  with  one  unmatched  literal  large (xl)  that  belongs  to  the  chain  {1}.  G49 
subsumes  G46  and  chain  13.  CLUSE  knows  from  the  BK  (a  taxonomy  of  geometric 
forms)  that  an  equilateral  triangle  and  an  isosceles  right-angled  triangle  are  also  isos¬ 
celes  triangles.  So,  equil  (V12)  from  G46  and  iso_rangl  (x26)  from  chain  13  are 
generalized  into  iso  (V18 )  in  G49. 


The  Structure  of  generalization 

F+ 

F- 

G58:  leftof(Vl,V2),Rect(Vl),  Iso(V2). 

Rect(Vl)  ->  (sq(x37),  rect(V21)). 

Iso(V2)  ->  (equil(x38),  iso(V22)). 

0.5 

0.25 

Iarge(x29)  {15} 

Subs,  chains:  {1,3,  5, 7, 9,  11,  13,  15, 17,  19} 

0.05 

0 

G51:  leftof(V21,V22),  Rect(V21),  iso(V22),  red(V22). 

Rect(V21)  ->  (sq(V17),  rect(V9)). 

Subs,  chains:  { 1,  3,  5,  7,  9,  1 1,  13,  17} 

0.4 

0 

G49:  leftof(V17,V18),  sq(V17),  Iso(V18),  red(V18). 

P  * 

Iso(V18)  ->  (equii(V12),  iso_rangl(x26)). 

Subs,  chains:  {1,7,13,17} 

046:  leftof(Vll,V12),  sq(Vll),  cquil(V12),  red(V12). 

large(xl)  {1} 

Subs,  chains:  {1, 7,  17} 

G45:  Icftof(V9,V10),  Jarge{V9),  rect(V9),  Iso(VlO),  red(VlO). 

Iso(V10)  ->  (iso(Vl),  isojrangi(.x6)). 

Subs,  chains:  {3,5,9} 

Figure  4.  CLUSE  generalizes  chains  into  a  structure  and  prunes  this  structure  ac¬ 
cording  to  generalizations’  coverage  frequencies.  Generalizations  are  identified  by  G# 
(and  appear  in  bold),  followed  by  predicate  bindings,  unmatched  literals,  and  sub¬ 
sumed  chains.  Generalizations  G45,  G46,  and  G49  are  pruned  (shaded). 


After  the  generalization,  CLUSE  prunes  the  structure  in  a  top-down  manner  ac¬ 
cording  to  the  coverage  frequencies  of  generalizations.  For  instance,  CLUSE  preserves 
G58,  which  covers  50%  of  the  positives  and  25%  of  the  negatives  (f+  =  0.5  and  F- 
=  0.25).  The  combination  of  the  unmatched  literal  large  (x29)  with  G58  covers  5% 
of  the  positives  and  none  of  the  negatives.  It  covers  fewer  negatives  than  G58  itself,  so 


6  Similarly  for  negative  chains  subsumed. 
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the  unmatched  literal  is  preserved.  CLUSE  continues  with  G51  and  finds  the  coverage 
frequencies  to  be  40%  for  the  positives  and  0%  for  the  negatives.  G51  covers  fewer 
negative  examples  than  G58,  so  CLUSE  preserves  it.  Since  the  coverage  of  negatives 
is  already  at  the  minimum  and  generalizations  under  G51  are  more  specific  than  G51 
(and  similarly  if  G51  had  unmatched  literals),  CLUSE  knows  that  no  generalizations 
under  G51  can  cover  fewer  negatives  than  G51.  Therefore,  CLUSE  prunes  all  descen¬ 
dants  of  G51  ( i.e .  G49,  G46,  and  G45). 

Generalizations  left  after  pruning,  are  returned  with  their  frequencies  as  learned 
DDCs  (Table  l)7.  A  generalization  with  each  of  its  unmatched  literals  makes  a  DDC. 
For  instance,  DDCs  1  and  2  are  created  from  generalization  G58:  DDC  1  corresponds 
to  G58  itself,  DDC  2  corresponds  to  G58  with  its  unmatched  literals  large  (Y) .  Table 
1  also  shows  DICs  generalized  from  each  DDC. 


Table  1.  Learned  DDCs  with  their  coverage  frequencies  and  their  corresponding 
DIC. 


G# 

DDC 

F+ 

F- 

DIC 

58 

leftof(X,Y),Rect(X),Iso(Y) 

0.5 

0.25 

ExtP  1  (X,Y),IntPl  (X),IntP2(  Y) 

58 

leftof(X,Y),Rect(X),large(X),Iso(Y 

) 

leftof(X,Y),Rect(X),iso(Y),red(Y) 

0.05 

0 

ExtPl(X,Y),IntPl(X),Extp2(X),IntP2(Y) 

51 

0.4 

0 

ExtPl  (X,Y),IntPl  (X),ExtP2(Y),ExtP3(Y) 

4  Conclusion 

The  paper  presented  an  extension  of  LGG/RLGG  that  exploits  additional  knowledge 
available  during  generalization  and  the  similarity  of  literals  in  the  context  of  the 
clauses  being  generalized.  CLGG  is  less  expensive  to  apply  than  LGG/RLGG.  No 
literals  are  added  to  clauses  prior  to  the  generalization,  and  matching  is  restricted  to 
literals  with  similarity  bindings. 

This  paper  showed  the  underlying  algorithm  to  learn  cliches  with  CLUSE  and  the 
algorithm’s  complexity.  CLUSE  uses  CLGG  and  the  notion  of  chains  to  learn  rela¬ 
tional  cliches  in  a  bottom-up  manner  into  a  hierarchy  of  generalizations.  CLUSE 
prunes  this  hierarchy  according  to  the  generalizations’  coverage  frequencies  of  chains. 
Preserved  generalizations  and  their  coverages  are  returned  as  learned  DDCs.  DDCs 
are  further  generalized  into  DICs  with  variable  predicates.  DDCs  are  considered  do¬ 
main-dependent,  since  they  are  expressed  with  predicates  specific  to  a  domain, 
whereas  DICs  are  domain-independent. 

CLUSE  could  be  used  to  create  a  library  of  concept  hierarchies  from  different  do¬ 
mains  of  application.  Concept  hierarchies  (as  described  in  Langley  [5])  provide  a 
better  memory  organization  than  flat  lists  of  cliches  and  allow  some  pruning,  giving  a 
solution  to  the  utility  problem  [5].  Classifying  new  instances  with  a  concept  hierarchy 
involves  moving  downward  through  the  hierarchy.  At  each  level,  instantiate  the  cliche 
or  use  coverage  frequencies  on  the  alternative  nodes  to  select  one  to  expand,  then 
recurse  to  the  next  level. 


7  For  simplicity,  variable  names  are  changed. 
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Abstract.  This  paper  introduces  the  design  of  rough  neurons  based  on  rough 
sets.  Rough  neurons  instantiate  approximate  reasoning  in  assessing  knowledge 
gleaned  from  input  data.  Each  neuron  constructs  upper  and  lower 
approximations  as  an  aid  to  classifying  inputs.  The  particular  form  of  rough 
neuron  considered  in  this  paper  relies  on  what  is  known  as  a  rough  membership 
function  in  assessing  the  accuracy  of  a  classification  of  input  signals.  The 
architecture  of  a  rough  neuron  includes  one  or  more  input  ports  which  filter 
inputs  relative  to  selected  bands  of  values  and  one  or  more  output  ports  which 
produce  measurements  of  the  degree  of  overlap  between  an  approximation  set 
and  a  reference  set  of  values  in  classifying  neural  stimuli.  A  class  of  Petri  nets 
called  rough  Petri  nets  with  guarded  transitions  is  used  to  model  a  rough  neuron. 
An  application  of  rough  neural  computing  is  briefly  considered  in  classifying 
the  waveforms  of  power  system  faults.  The  contribution  of  this  article  is  the 
presentation  of  a  Petri  net  model  which  can  be  used  to  simulate  and  analyze 
rough  neural  computations. 


1.  Introduction 

This  paper  considers  the  design  of  a  rough  neuron,  which  is  based  on  rough  set  theory 
[l]-[3].  The  study  of  rough  neurons  is  part  of  a  growing  number  of  papers  on  neural 
networks  based  on  rough  sets.  Rough-fuzzy  multilayer  perceptrons  in  knowledge 
encoding  and  classification  were  introduced  in  [4],  Rough-fuzzy  neural  networks 
have  recently  been  also  used  in  classifying  the  waveforms  of  power  system  faults  [5]- 
[6].  Purely  rough  membership  function  neural  networks  were  introduced  in  [7]  in  the 
context  of  rough  sets  and  the  recent  introduction  of  rough  membership  functions  [8]. 
There  are  two  types  of  rough  neurons:  approximation  neurons  and  rule-based  decider 
neurons.  An  approximation  neuron  consists  of  a  number  of  input  ports  governed  by 
filters,  a  processing  element  which  constructs  a  rough  set,  and  one  or  more  output 
ports  which  utilizes  a  rough  membership  function  to  compute  the  degree-of-accuracy 
of  the  approximate  knowledge  represented  by  the  rough  set  derived  by  the  neuron. 
The  notion  of  an  input  port  filter  comes  from  signal  processing.  A  filter  is  a  device 
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which  transmits  signals  in  a  selected  band  of  frequencies  and  rejects  (or  attenuates) 
signals  in  other  bands  [9]-[10].  Filters  can  be  calibrated  by  adjusting  the  bandwidth 
of  values,  which  can  stimulate  an  approximation  neuron.  The  contribution  of  this 
article  is  the  presentation  of  a  Petri  net  model  of  a  rough  neuron  which  can  be  used  to 
simulate  and  analyze  rough  neural  computations. 

This  paper  is  organized  as  follows.  The  basic  concepts  of  rough  sets, 
decision  rules  and  rough  membership  functions  underlying  the  design  of  rough 
neurons  are  presented  in  Section  2.  The  design  of  sample  rough  neurons  is  also 
presented  in  Section  2.  A  Petri  net  model  of  a  rough  neuron  is  given  in  Section  4. 


2.  Basic  Concepts 

A  brief  introduction  to  the  basic  concepts  underlying  the  design  of  rough  membership 
function  neurons  is  given  in  this  section. 


2.1  Rough  Sets 

Rough  set  theory  offers  a  systematic  approach  to  set  approximation  [l]-[3],  [8].  To 
begin,  let  S  =  (U,  A)  be  an  information  system  where  U  is  a  non-empty  finite  set  of 
objects  and  A  is  a  non-empty  finite  set  of  attributes  where  a\U  — >  Va  for  every  a  e 
A.  For  each  Be  A,  there  is  associated  an  equivalence  relation  IndA(B)  such  that 

IndA(fl)  =  {(x,x’)et/2  |VaeB.a(x)=a(x’)}  (D 


If  (x,  x)  €  IndA(B),  we  say  that  objects  x  and  x’  are  indiscernible  from  each  other 
relative  to  attributes  from  B.  The  notation  [x]B  denotes  equivalence  classes  of 
IndA(B).  For  X  c  U,  the  set  X  can  be  approximated  only  from  information 
contained  in  B  by  constructing  a  B-lower  and  B-upper  approximation  denoted  by  gX 
and  BX  respectively,  where  BX  =  {  x  |  [x]B  c  X  }  and 
BX  =  {  x  |  [x]B  nX#0).  The  objects  of  BX  can  be  classified  as  members  of  X 
with  certainty,  while  the  objects  of  BX  can  only  be  classified  as  possible  members  of 
X.  Let  BNb(X)  =  BX  -  gX .  A  set  X  is  rough  if  BNb(X)  is  not  empty. 


2.2  Rough  Membership  Functions 

A  rough  membership  function  (rmf)  makes  it  possible  to  measure  the  degree  that  any 
specified  object  with  given  attribute  values  belongs  to  a  given  set  X  [8],  [16].  A  rm 

function  is  defined  relative  to  a  set  of  attributes  Sc  A  in  information  system  S  = 
(U,  A)  and  a  given  set  of  objects  X.  The  equivalence  class  [x]B  induces  a  partition  of 
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the  universe.  Let  B  c  A  ,  and  let  X  be  a  set  of  observations  of  interest.  The  degree 
of  overlap  between  X  and  [x]B  containing  x  can  be  quantified  with  a  rmf  given  in  (2): 


:U  —>  [0, 1]  defined  by  fix  (x)  = 


M.I 


(2) 


2.3  Example  Rough  Membership  Function 

A  sample  rough  member  function  computation  is  given  in  this  section  (see  Fig.  1(a)). 


Mr  ( «  )  = 


|fiFn[««]s 

INJ 


1 

4 


Fig.  1(a).  Sample  rmf  value 


1 

L  Jb 

l _ 

Fig.  1(b).  Overlapping  regions 


Let  B  be  a  set  of  attributes  of  waveforms  of  power  system  faults  (e.g.,  bl  =  phase 
current,  b2  =  maximum  phase  current,  and  so  on).  Let  F  be  a  set  of  fault  signal  files. 
Further,  let  BF  =  {f3,f4,f7,f8}  be  an  upper  approximation,  and  let  [u]B  =  {f4,  f9,  flO, 
fl5 }  be  an  equivalence  class  containing  files  representing  a  known  fault.  For 
overlapping  regions  shown  in  Fig.  1(b),  the  degree  of  overlap  between  BF  and  [u]B 
can  be  computed  as  in  Fig.  1(a). 

2.4  Design  of  Rough  Neurons 

Neural  networks  are  collections  of  massively  parallel  computation  units  called 
neurons.  A  neuron  is  a  processing  element  in  a  neural  network.  Two  types  of  rough 
neurons  have  been  identified:  approximation  and  decider  neurons  [7].  Let  BX  be  an 
upper  approximaton  relative  a  set  of  attributes  B  and  reference  set  X.  An 

approximation  neuron  r\  computes  y  =  ju"  [bX \u\ri  j .  A  decider  rough  neuron 

implements  a  collection  of  decision  rules  by  (i)  constructing  a  condition  vector  cexp 
from  its  inputs  which  are  rm  function  values  (ii)  discovering  the  rule  c,  =>  d,  with  a 
condition  vector  c,  which  most  closely  matches  an  input  condition  vector  cexp,  and  (iii) 
outputs  minfe^)  where  d,  e  {0,1 }  and  ej  =  j|cexp  -  c,  )||/||cj|  e  [0,1].  In  cases  where  d 
=  0,  then  yrule  =  min(ed)  =  0,  and  the  classification  is  unsuccessful.  If  d  =  1,  then  yntIe 
=  min(ed)  =  el  indicates  the  relative  error  in  a  successful  classification. 
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2.5  Sample  Rough  Neural  Network 

A  high  voltage  direct  current  (dc)  transmission  system  connected  between  ac  source 
and  ac  power  distribution  system  has  two  converters.  In  the  case  where  the  flow  of 
power  is  from  the  ac  side  to  the  dc  side  as  in  Fig.  2,  then  a  converter  acts  as  a  rectifier 
in  changing  ac  to  dc.  The  inverter  in  Fig.  2  converts  dc  power  to  ac  power  at  desired 
output  voltage  and  frequency.  The  Dorsey  Station  in  the  Manitoba  Hydro  system,  for 
example,  acts  as  an  inverter  in  converting  dc  to  ac,  which  is  distributed  throughout 
North  America. 


ac 

rectifier 

Q_H— coi!  _ 

_ coil  _ 

Bwater 

ac 

system 

^•J>  MJlLLL 

.L 

/ 

T 

Fig.  2.  dc  Link  Between  ac  Systems 


A  decision  (d)  to  classify  a  waveform  for  a  power  transmission  fault  depends  on  an 
assessment  of  phase  current  (pc),  current  setting  (cs),  maximum  phase  current  (max 
pc),  ac  voltage  error  (acve),  pole  line  voltage  (plvw)  and  phase  current  (pew) 
waveforms.  A  sample  commutation  failure  decision  table  is  given  next.  In  Table  1, 
d  =  1  {0}  indicates  that  the  waveform  for  a  fault  represents  {does  not  represent}  a 
power  system  failure. 


Table  1.  Sample  Power  System  Failure  Decision  Table 


acve 

pc/cs 

plvw 

pew 

cs 

max  pc 

d 

file 

1 

0.059 

0.069 

0 

0.0187 

0 

0 

0 

file 

3 

0.059 

0.069 

1 

0.0187 

0.1667 

0.0856 

1 

Signal  data  needed  to  construct  the  condition  granules  in  Table  1  come  from  files 
specified  in  column  1  of  the  table.  Sample  discretized  rules  derived  from  Table  1 
using  Rosetta  [17]  are  given  in  (3)  and  (4). 

plvw([*,  0.750))  AND  cs([0.111,  *))  AND  max-pc([*,  0.043))  =>  d(no)  (3) 

plvw([0.750,  *))  AND  cs([0.111,  *))  AND  max-pc([0.043,  *))  =>  d(yes)  (4) 
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Fig.  3(c).  Decider  neuron 


The  basic  structure  of  a  rough  neural  network  is  given  in  Fig.  3(a)The  decider 
neuron  in  Fig.  3(b)  implements  rules  derived  from  Table  1.  In  the  network  in  Fig. 
3(a),  the  parameters  to  be  tuned  are  represented  by  B,  the  set  of  relevant  features. 
The  goal  of  tuning  is  to  improve  the  quality  of  concept  (Fault)  approximation. 

2.6  Sample  Verification 

A  comparison  between  the  output  from  a  rough  neural  network  used  to  classify  power 
system  faults  relative  to  24  fault  files  and  known  classification  of  the  sample  fault  data 
is  given  in  Fig  4.  In  all  of  the  cases  considered  in  Fig.  4,  there  is  a  close  match 
between  the  target  faults  and  the  faults  identified  the  neural  network. 


3.  Petri  Net  Model  of  a  Rough  Neuron 

In  what  follows,  it  is  assumed  that  the  reader  is  familiar  with  classical  Petri  nets  [19] 
and  coloured  Petri  nets  [20].  Rough  Petri  nets  are  derived  from  coloured  and 
hierarchical  Petri  nets  as  well  as  from  rough  set  theory  [21].  A  rough  Petri  net 
provides  a  basis  for  modeling,  simulating  and  analyzing  rough  neurons,  rough  neural 
networks,  and  granular  decision  systems. 

3.1  Rough  Petri  Nets 

A  rough  Petri  net  (rPn)  is  a  structure  (X,  P,  T,  A,  N,  C,  G,  E,  I,  W,  91,  i;)  where 

•  S  is  a  finite  set  of  non-empty  data  types  called  color  sets. 

•  N  is  a  1- 1  node  function  where  N:  A  — >  (P  x  T)  u  (T  x  P). 
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•  C  is  a  color  function  where  C:  P— >£. 

•  G  is  a  guard  function  where  G:  T  — »  [0,  1]. 

•  E  is  an  arc  expression  function  where  E:  A  — >  Set_of_Expressions  where  E(a)  is  an 
expression  of  type  C(p(a))  and  p(a)  is  the  place  component  of  N(a). 

•  I  is  an  initialization  function  where  I:  P->  Set_of_Closed_Expressions  where  I(p)  is 
an  expression  of  type  C(p). 

•  W  is  a  set  of  strengths-of-connections  where  q:  A  — >  W. 

•  9?  =  { p5 1  p  constructs  b,  e  {rough  set  structure} } 

Let  U,  S,  A,  d  be  a  set  of  inputs,  information  system  S,  attributes  of  S,  decision  d, 
respectively.  Examples  of  rough  set  structures  constructed  by  p  from  information 
granules  are  the  decision  system  S  =  (U,  A  u  {d})  and  the  set  OPT(S)  of  all  rules 
derived  from  reducts  of  a  decisions  system  table  for  S.  Borrowing  from  coloured 
Petri  nets,  a  rough  Petri  net  provides  data  typing  (colour  sets)  and  sets  of  values  of  a 
specified  type  for  each  place.  The  expression  E(p,  t)  specifies  the  input  associated 
with  the  arc  from  input  place  p  to  transition  t,  and  the  expression  E(t,  p’)  specifies  a 
transformation  (activity)  performed  by  transition  t  on  its  inputs  {E(p,  t)}  to  produce  an 
output  for  place  p’. 


Fig.  4.  Sample  Verification 


3.2  Guarded  Transitions 

In  a  rough  Petri  net,  various  families  of  guards  can  be  defined  which  induce  a  level-of- 
enabling  of  transitions  [21].  Consideration  of  level-of-enabling  stems  from  guards 
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named  after  Jan  Lukasiewicz  [22],  who  inaugurated  the  study  of  multivalued  logic. 
Let  U  denote  a  universe  of  objects,  and  let  X  c  U.  Let  L:U  — >[0,  1], 

Def.  1  Lukasiewicz  Guard.  A  Lukasiewicz  guard  on  transition  t  with  input  x  is  a 
higher  order  propositional  function  P(Xfx))  labeling  the  transition  t  with  input  x  and 
output  A(x).  The  guard  P(A(x))  =  A(x)  e  (0,1],  where  0  <  A(x)  <  1  enables  t. 

With  one  exception,  notice  that  Xfx)  can  be  used  to  model  a  filter  on  an  input  port 
of  an  approximation  neuron,  since  there  is  interest  in  preventing  input  signals  with 
zero  strength  from  enabling  an  input  transition.  To  complete  the  modeling  of  an 
input  port  filter,  a  restricted  Lukasiewicz  guard  is  needed. 

Def.  2  Restricted  Lukasiewicz  Guard.  A  restricted  Lukasiewicz  guard  on 
transition  t  with  input  x  is  a  function  P(A(x))  labeling  the  transition  t  with  input  x  and 
output  A(x).  The  guard  P(A(x))  =  Afx)  e  (0,1],  where  0  <  A(x)  <  1  enables  t. 


Fig.  5.  Rough  Neuron  Petri  Net  Model 

3,3  Petri  Net  Model  of  a  Rough  Neuron 

Let  T)  be  an  approximation  neuron  with  a  single  input  port  p,  and  single  output  port  po. 

Let  X  be  a  set  of  inputs  for  rj,  B  a  set  of  attributes,  and  X  a  filter  on  p,  of  t).  Let  p  be 
a  procedure  which  constructs  BX,  BX  and  let  f/x  (x)  compute  the  output  of  T|  (see 
Fig.  5).  The  notation  ?p  indicates  a  receptor  place  which  is  always  input  ready.  The 

filter  Afx)  returns  x  in  cases  where  A(x)  >  0,  Afx)  e  [a,  b]  c  (0,  1],  The  transition 
labeled  “rough”  in  Fig.  5  is  enabled  by  the  input  of  signal  x  and  set  of  attributes  B. 
When  this  transition  fires,  p(x)  constructs  BX,  BX  .  The  availability  of  BX,  BX 
and  equivalence  class  [u]B  enables  the  transition  labeled  “rmf '  in  Fig.  5.  Whenever 
the  rmf  transition  fires,  /uBx  (x)  computes  the  degree  of  overlap  between  [u]B  and 

BX  .  The  advantage  in  constructing  a  Petri  net  model  of  a  rough  neuron  is  facilitates 
a  number  of  tests  such  as  reachability  of  each  of  the  transitions  in  the  model  and  the 
action  of  the  guard  modeling  a  filter  on  a  rough  neuron  input  port. 
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4.  Concluding  Remarks 

The  basic  features  in  the  design  of  a  particular  kind  of  rough  neuron  called  an 
approximation  neuron  are  presented  in  this  paper.  The  introduction  of  rough  neurons 
has  been  motivated  by  the  search  for  improved  means  of  identifying  and  classifying 
features  in  a  feature  space.  The  output  of  an  approximation  neuron  is  a  rough 
membership  function  value,  which  indicates  the  degree  of  overlap  between  an 
approximation  region  and  some  other  set  of  interest  in  a  classification  effort.  A  Petri 
net  model  of  an  approximation  neuron  has  also  been  given.  The  guarded  transitions 
in  a  rough  Petri  net  make  it  possible  to  model  a  filter  on  an  input  port  of  a  rough 
neuron.  A  sample  application  of  these  neurons  in  a  power  system  fault  classification 
system  has  been  given.  Future  work  will  entail  a  study  of  a  more  complete 
classification  and  design  of  rough  neurons. 
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Abstract:  In  order  to  search  through  a  sound  database,  information  about  the 
musical  contents  has  to  be  attached  to  the  file,  otherwise  the  user  has  to  look  for 
the  specific  musical  information  by  himself.  Wavelet  analysis  is  one  of  possible 
tools  that  can  be  used  as  a  basis  for  automatic  classification  of  musical  data.  In 
this  paper,  the  author  presents  wavelet-based  parameters  extracted  from  sounds 
of  musical  instruments.  These  parameters  have  been  used  as  a  basis  of  automatic 
classification  of  musical  instrument  sounds.  Tests  evaluating  the  efficiency  of 
such  parameterization  were  performed  by  means  of  rough  set  based  algorithms 
and  decision  trees.  Results  of  these  tests  are  presented  in  this  paper. 

Keywords:  Knowledge  Discovery  and  Data  Mining,  Soft  Computing,  Sound 
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1  Introduction 

One  of  the  main  problems  concerning  sound  databases  is  how  to  classify 
automatically  the  musical  material,  contained  in  a  recording.  For  instance,  if  the 
information  what  musical  instruments  are  playing  in  the  piece  is  not  attached  to  a  file, 
it  is  not  possible  to  extract  such  information  automatically.  Automatic  classification 
makes  possible  automatic  labeling  of  the  content  of  the  multimedia  data,  which  is  the 
aim  of  ISO/IEC  standard  MPEG-7  that  is  under  development  by  MPEG  (Moving 
Picture  Experts  Group). 

Sounds  (even  singular  sounds)  that  are  to  be  processed  by  a  classification 
algorithm  cannot  be  represented  as  raw  data,  i.e.  as  a  set  of  samples.  Objects 
processed  by  a  classifier  should  be  described  by  a  set  of  attributes  (the  less  the  better), 
so  sound  data  need  parameterization  before  classification.  Additionally,  since  the 
same  sound  may  be  changed  dramatically  by  musical  interpretation  and  recording 
conditions,  the  appropriate  parameterization  is  necessary  as  a  preprocessing  before 
classification. 

Such  a  parameterization  is  based  on  sound  analysis.  Sounds  can  be  analyzed  both 
in  temporal  and  spectral  domain,  using  many  methods,  such  as  Fourier  transform, 
wavelet  transform,  correlation,  cepstral  analysis,  filtering,  statistical  methods  and  so 
on  [1],  [2],  [3],  [4],  Wavelet  transform  is  especially  useful  for  musical  applications, 
because  this  time-frequency  analysis  divides  the  spectrum  into  frequency  bands  that 
are  of  equal  width  in  a  logarithmic  scale,  what  is  similar  to  the  human  hearing. 
Therefore,  wavelet  analysis  can  be  used  as  a  tool  for  the  classification  of  musical 
instrument  sounds  and  for  labeling  of  the  recordings. 
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The  sound  parameterization  that  is  appropriate  for  instrument  classification  is  a 
difficult  task  and  usually  requires  very  careful  choice  of  attributes  [6],  [12].  The 
parameterization  presented  here  is  quite  simple  and  the  set  of  attributes  contains  162 
parameters.  Many  attributes  are  redundant  and  soft  computing  methods  have 
been  used  to  find  the  most  useful  attributes  among  them.  In  this  paper,  presented 
results  were  obtained  using  decision  trees  and  rough  set  based  methods,  but  other 
classifiers  (such  as  neural  networks,  k-nearest  neighbor  etc.)  can  also  be  used  [5],  [9], 
[10],  [12],  [13]. 


2  Wavelet  Analysis  and  Parameterization  of  Musical  Data 

Wavelet  analysis  applied  in  the  presented  work  is  based  on  the  division  of  the 
spectrum  into  octave  bands,  using  filter  of  the  second  order,  proposed  by  Daubechies 
and  Coifman  (see  Fig.  1).  The  wavelet  transform  of  the  function  /is  performed  as  a 
decomposition  of  /  using  a  mother  wavelet  \j/  and  a  scaling  function  (p  in  the 
following  way  [11]: 

where:  <>  -  inner  product,. 

j  -  resolution  level, 
k  -  time  instant, 

{M0  =  2'/anr(2'f-*)}, 

y/(t)  =  g  k<p(2t  -  k) , 

{(pjk{t)  =  V'2<p{Vt-k)}, 

<p(t)  =  hk<P(2t  -k)' 

gk,,  hk  -  coefficients  of  highpass  and  lowpass  filters, 

A  =  (-1)*V 

The  above  analysis  gives  good  frequency  resolution  and  poor  time  resolution  for 
low  frequency  bands,  and  good  time  resolution  and  poor  frequency  resolution  for  high 
frequency  bands. 

Singular  sound  of  any  instrument  can  be  quite  long:  it  may  last  for  some 
seconds,  and  its  timbre  may  change  with  time.  Since  the  most  important  parts  of 
sound  for  the  recognition  by  human  is  the  beginning  (starting  transient,  i.e.  the  attack) 
and  the  middle  part  of  the  sound  (quasi-steady  state),  these  parts  have  been  taken  into 
account  during  parameterization.  Sounds  from  CDs  [7],  digitally  recorded  stereo  with 
sampling  frequency  44.1kHz  and  16  bit  resolution,  have  been  analyzed  (each  channel 
separately)  using  wavelet  transform  with  analyzing  frame  4096  samples,  taken  from 
the  attack  and  from  the  quasi-steady  state  of  the  sound.  The  calculated  parameters  are 
based  on  a  part  of  each  frame,  containing  the  coefficient  of  the  greatest  energy. 
Exemplary  result  of  wavelet  analysis  of  a  sound,  with  the  area  selected  for 
parameterization  marked  by  a  black  frame,  is  presented  in  Fig.  2. 
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Fig.  1.  Scaling  functions  (p  and  mother  wavelets  \jf  for  filters  of  order  2: 
(a,  b)  proposed  by  Coifman 
(c,  d)  proposed  by  Daubechies 
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Fig.  2.  Wavelet  analysis  (Daubechies  (p  and  yr )  of  the  clarinet  sound  a3  (1760 
Hz), for  sampling  frequency  44. 1  kHz  and  analyzing  frame  4096  Sa; 
the  darker  the  area,  the  grater  the  magnitude 
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The  parameters  calculated  on  the  wavelet  analysis  are  as  follows  [12]: 

•  W,,  W38  -  energy  of  the  parameterized  spectrum  bands  for  the  Daubechies 

wavelet  of  order  2,  in  the  middle  of  the  attack; 

W=E/E,  where: 

E  -  overall  energy  of  the  parameterized  part  of  the  frame;  E,  -  partial  energy: 
i=23,. .  .,38  -  spectral  components  in  the  frequency  band  1 1.025-22.05kHz, 
i=15,...,22  -  spectral  components  in  the  frequency  band  5.5125-11.025kHz, 

i=l  1, _ ,14  -  spectral  components  in  the  frequency  band  5. 5 125-1 1.025kHz, 

i=9, 10  -  spectral  components  in  the  frequency  band  2.75626-5.5 125kHz, 
i=l, _ ,8  —  spectral  components  for  lower  frequency  bands; 

•  W39,  ...,  W76  -  energy  for  Daubechies  wavelet  of  order  2,  in  the  middle  of  the 
steady  state; 

•  W77  -  position  of  the  middle  of  the  attack,  w77  e  (0, 1) ; 

•  W78  -  position  of  the  middle  of  the  steady  state,  VV78  e  (0, 1) ; 

•  W79,  ...,  Wll5  -  energy  of  the  parameterized  spectrum  bands  for  the  Coifman 

wavelet  of  order  2,  in  the  middle  of  the  attack; 

W=E/E,  where: 

i=100,. .  .,1 15  -  spectral  components  in  the  frequency  band  1 1.025-22.05kHz, 
i=92,...,99  -  spectral  components  in  the  frequency  band  5.5 125- 11. 025kHz, 
i=88,...,91  -  spectral  components  in  the  frequency  band  5.5125-11.025kHz, 
i=86,87  -  spectral  components  in  the  frequency  band  2.75626-5.5 125kHz, 
i=79,. .  .,85  -  spectral  component  in  lower  frequency  bands; 

•  WU6, ...,  Wl5,  -  energy  for  Coifman  wavelet  of  order  2,  in  the  middle  of  the  steady 
state. 

The  set  of  parameters  presented  here  is  used  to  describe  singular  sound  of  a 
musical  instrument.  The  data  represent  all  available  sounds  of  musical  scale  of  the 
following  instruments: 

•  bowed  string  instruments:  violin,  viola,  cello  and  double  bass; 

•  woodwinds:  flute,  oboe  and  clarinet; 

•  brass:  trumpet,  trombone,  French  horn  and  tuba. 

Sounds  of  these  instruments  were  recorded  using  various  playing  techniques, 
namely  vibrato,  pizzicato  (when  strings  are  plucked  with  fingers),  with  and  without 
muting. 

The  investigated  data  have  been  grouped  into  classes  in  the  following  ways: 

•  18  classes  -  each  class  contains  objects  representing  sounds  of  one  instrument, 
played  using  one  technique:  flute  -  vibrato,  oboe  —  vibrato,  B  flat  clarinet,  C 
trumpet,  C  trumpet  -  muted,  French  horn,  French  horn  -  muted,  tenor  trombone, 
tenor  trombone  -  muted,  tuba,  violin  -  vibrato,  violin  -  pizzicato,  viola  -  vibrato, 
viola  -  pizzicato,  cello  -  vibrato,  cello  -  pizzicato,  double  bass  -  vibrato,  double 
bass  -  pizzicato; 

•  11  classes,  containing  objects  representing  sounds  of  one  instruments,  played 
using  various  techniques:  flute,  oboe,  B  flat  clarinet,  C  trumpet,  French  horn, 
tenor  trombone,  tuba,  violin,  viola,  cello,  double  bass; 

•  5  classes,  containing  objects  representing  sounds  of  family  of  instruments,  played 
with  the  same  technique:  woodwinds,  brass  without  muting,  brass  with  muting, 
strings  vibrato,  strings  pizzicato; 
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•  3  classes,  containing  objects  representing  sounds  of  family  of  instruments,  played 
with  various  techniques:  woodwinds,  brass,  strings; 

•  4  classes,  representing  sounds  of  instruments  played  with  the  same  technique: 
vibrato  (strings,  flute,  oboe),  pizzicato  (strings),  muting  (trumpet,  French  horn, 
trombone),  without  vibrato  or  muting  (clarinet,  brass) 

•  3  classes,  representing  sounds  of  instruments  played  with  the  same  technique,  but 
muting  is  not  extracted  into  separate  class:  vibrato  (strings,  flute,  oboe),  without 
vibrato  (clarinet,  brass),  pizzicato  (strings). 

The  data  set  contains  1358  objects.  The  proposed  set  of  parameters  is  probably 
superfluous,  but  classification  algorithms  can  be  applied  as  a  tool  of  filtration  of  these 
attributes. 


3  Classification  Algorithms 


As  it  was  mentioned  in  the  first  section,  automatic  classification  of  data  can  be 
performed  in  many  ways.  In  the  described  research,  classification  algorithms  have 
been  used  not  only  to  learn  classification  rules,  but  also  to  test  the  proposed  set  of 
parameters.  The  author  decided  to  choose  decision  trees  and  rough  set  based 
algorithms,  because  they  are  quite  fast  and  present  the  results  in  a  way  that  is  visually 
easy  to  interpret  for  a  user. 

Decision  trees  we  use  here  are  all  binary,  constructed  from  a  root  to  leaves. 
Nodes  are  labeled  with  attributes  (parameters),  chosen  by  maximal  gain  ratio  criterion 
[9]: 


1  ( a  — > 
H{a) 


where 


I  (a  — >  d  )  =  H(d  )  -  H(d  \  a)  -  information  gain  for  the  attribute  a  and  the  class  d 
k 

H{d)  =  p(dt )  •  log  p(dj )  -  entropy  of  the  class  d 

(=i 

i  k 

H(d\a)  =  p(ai )  •  ^  p(di  \  a )  •  log  p(d(  \  a  j )  -  conditional  entropy  for  d  and  a 
j=i  i'=i 

p(v)  -  probability  of  the  value  v. 

Edges  are  labeled  with  values  of  the  attribute  labeling  the  parent  node.  Real  value 
data  are  quantized,  and  optimal  cut  point  c  is  found  on  the  basis  of  the  entropy 
criterion.  Attribute  values  x  are  divided  into  2  sets:  x>c  and  x  <  c  ,  and  these  sets  are 
used  to  label  the  edges.  The  leaves  of  the  tree  represent  classes  with  probability 
controlled  by  a  user. 

Since  objects  representing  the  investigated  classes  are  mixed  and  some  attributes 
in  the  data  set  are  redundant,  the  created  trees  have  been  pruned,  which  results  in  the 
reduction  of  their  number  of  attributes,  their  depth  and  the  number  of  branches. 
Generally  speaking,  the  pruning  is  driven  by  the  admissible  probability  of  incorrect 
classification  of  new  coming  objects. 
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Rough  set  [8]  theory  is  based  on  a  specific  concept  of  membership  function, 
describing  elements  x  of  a  set  X.  In  classical  Cantor  theory,  the  membership  function 

Hx  (x)  is  defined  as 

J"0  for  xi.  X 
[l  forxe  X 

In  rough  set  theory,  we  assume  preliminary  information  I{x)  about  elements 
X  6  U  of  the  set  X  c  U  ■  On  the  basis  of  the  information  function  I  \  U  2,J  such 


t^x  (*)  ~ 


that  ( \/x  E  U )  [  x  E  I(x)  ],  the  membership  function  ju^  (x)  e  [0, 1]  is  defined  as 


Mx(x)  = 


card(X  n  I(x)) 
card  I(x) 


Rough  set  based  systems  allow  processing  of  imprecise  or  inconsistent  data.  The 
system  used  in  the  described  research  [10]  includes  quantization  of  real- value 
attributes.  The  domain  of  each  attribute  is  divided  into  intervals  of  equal  width; 
number  of  intervals  (10  by  default)  can  be  selected  by  the  user. 


4  The  Experiments 

Both  decision  trees  and  rough  set  based  systems  have  been  used  to  extract  rules 
describing  musical  instrument  data.  Exemplary  decision  tree  describing  the 
investigated  data  for  18  classes  (a  part  only)  is  presented  below: 

W78  <=  0.5661  : 

|  W132  <=  0.00012145  : 

|  |  W88  <=  0.0053521  : 

|  |  |  W51  >  0.0047161  :  violin  pizzicato 
|  |  j  W51<=  0.0047161  : 
j  |  j  |  W10>  0.00305402  :  viola  pizzicato 

j  |  |  |  W10  <=  0.00305402  : 

|  I  |  |  |  W43  <=  0.00090222  : 

I  I  I  I  I  I  W47  <=  0.00113098  :  double  bass  pizzicato 

|  |  j  j  j  |  W47  >  0.00113098  :  viola  pizzicato 

|  |  W88  >  0.0053521  : 
j  j  |  W3  <=  0.0003781  :  viola  pizzicato 
j  j  |  W3  >  0.0003781  :  violin  pizzicato 
|  W132>  0.00012145  : 
j  |  W43  >  0.117745  :  cello  pizzicato 
|  |  W43<=  0.1 17745  : 

|  |  |  W39  >  0.00088248  :  trumpet  muted 

W78>  0.5661  : 

|  W78  <=  0.9079  : 
j  |  W122<=  0.388391  : 
j  |  |  W45  <=  0.594423  : 
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The  attributes  in  the  tree  describe  both  the  attack  and  the  steady  state  of  the 
sound,  using  both  Daubechies  and  Coifman  wavelet.  The  root  of  the  tree  is  labeled  by 
one  of  temporal  attributes,  since  this  attribute  allows  easy  discernment  between 
sounds  played  pizzicato  and  vibrato.  The  same  attribute  labels  the  trees  for  each 
division  of  the  data  into  classes. 

Apart  from  decision  trees,  classification  rules  in  a  classical  form  have  also  been 
extracted  for  these  data,  using  both  C4.5  and  DataLogic/R+.  The  obtained  rules  are  of 
various  length  and  accuracy.  For  example,  some  rules  are  based  only  on  singular  or 
very  few  attributes: 

•  W152  >  0.0008401  =>  trumpet  muted  (accuracy  96.7%), 

•  W78  >  0.9087  a  Ws4  >  0.858949  =>  flute  (accuracy  79.4%), 

•  W7g  <  0.5661  a  W81  >  0.400169  =>  double  bass  pizzicato  (accuracy  85.7%), 

•  W43  >  0.117745  a  W7g  <  0.5661  a  W!4  >  0.0127438  =>  cello  pizzicato 
(accuracy  85.5%). 

Some  of  the  obtained  rules  are  longer,  for  example: 

•  W,  <  0.0001733  a  W78  <  0.5661  a  Wa8  >  0.00026218  a  W132  >  0.00012145  => 
violin  pizzicato  (accuracy  96.6%), 

•  W6  <  0.782861  a  W13  >  0.00019162  a  W!4  >  2.133e-05  a  W,s  <  0.00513477  a 

W45>  0.00027319  a  0.00041218  <  <  0.678519  a  W77  <  0.0258  a  0.9079 

<  W78  <  0.949  a  W85  >  5.832e-05  a  W93  >  2.819e-05  =>  clarinet  (accuracy 
93.4%), 

•  W13  >  5.94e-06  a  W34  <  7.555e-05  a  W51  >  0.00057545  a  W59  <  0.00097323 
a  W77  <  0.0583  a  0.5661<W78  <  0.8519  a  W121  >0.56432  a  W127  <  0.0369227 
=>  French  horn  (accuracy  97.9%), 

•  W73>5.054e-05  a  W74  <  3.476e-05  a  0.5661<W78  <  0.7812  a  W103  <  0.001 16903 
a  W122  >  0.380377  =>  oboe  (accuracy  82.0%), 

•  Wa<  0.00010887  a  W59  <0.00124617  a  0.5661<  W78  <  0.9087  a  W121  >  0.56432 
a  W127  >  0.0369227  =>  trombone  (accuracy  96.7%), 

•  W,  <  0.00370801  a  W28  <  0.00088872  a  W77  >  0.04505  a  W78  >  0.8276  a  W121 

<  0.56432  a  W122  <  0.380377  a  W148  <  5.086e-05  =>  tuba  (accuracy 
89.9%), 

•  W14  <  0.00288078  a  W77  >  0.0457  a  W78  >  0.5661  a  W103  >  0.00010219  a  W121  > 
0.56432  a  W127  <=  0.0369227  =>  viola  vibrato  (accuracy  80.9%), 

and  so  on. 

Rules  extracted  both  using  DataLogic/R+  and  C4.5  contain  attributes  calculated 
by  means  of  both  filters,  from  the  attack  and  steady  state  of  the  sound.  The 
constructed  classifiers  are  based  on  about  60  attributes. 

The  obtained  trees  and  rules  have  been  tested,  using  70%  of  the  data  as  a  training 
set  and  the  remaining  30%  as  a  test  set.  Rough  sets  based  experiments  have  been 
performed  for  various  settings  of  DataLogic/R+  [10]  and  the  best  results  have  been 
obtained  for  the  following  settings:  roughness  value  0.01,  rule  precision  threshold 
0.90,  i.e.  for  quite  precise  rules.  Decision  trees  have  also  been  created  for  various 
settings  of  C4.5  [9],  with  quite  good  results  for  standard  settings,  i.e.  with  pruning 
confidence  level  25%  (and  even  better  accuracy  for  the  settings  adjusted  individually 
to  each  data  set).  The  results  are  presented  in  Tab.  1. 
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Tab.  1.  Percentage  of  correct  classification  for  the  musical  instrument  sound  data 


Data 

Rough  sets 

Decision  trees 

18  classes 

42.75% 

57.9% 

11  classes 

46.68% 

56.8% 

5  classes 

63.14% 

75.8% 

3  classes  (woodwinds,  brass,  strings) 

69.53% 

80.1% 

4  classes 

70.27% 

78.6% 

3  classes  (vibrato  non-vibrato,  pizzicato) 

71.01% 

79.5% 

As  we  see,  the  results  for  decision  trees  are  about  10%  better  on  average,  and 
obviously  all  results  for  3-5  classes  are  better  than  for  11  or  18  classes.  These  results 
are  worse  than  70%  of  accuracy  obtained  for  different  methods  of  parameterization, 
i.e.  not  based  on  wavelet  analysis,  obtained  by  the  author  and  another  researchers  [6], 
[12].  But  this  wavelet-based  parameterization  is  very  simple  (and  can  be  improved), 
whereas  other  parameterization  methods  require  precise  calculation  of  pitch  and  very 
careful  analysis  of  spectrum  of  sounds.  Additionally,  in  real  recordings  we  usually 
have  a  sequence  of  sounds,  i.e.  a  musical  phrase,  instead  of  singular  sounds. 
Therefore,  classification  of  such  a  phrase  can  be  of  higher  accuracy.  Of  course, 
additional  preprocessing  is  necessary  in  real  recordings  so  as  to  extract  singular 
sounds  of  instruments,  namely  separation  of  solo  instrument  from  the  musical 
background,  and  separation  of  consequent  sounds  in  a  phrase. 


5  Conclusions 


The  contents-based  search  of  audio  and  video  data  is  one  of  main  goals  of  multimedia 
research  nowadays.  Therefore,  automatic  classification  of  musical  sounds  is  necessary 
as  a  tool  of  labeling  of  sound  data.  Parameterization  of  sounds  that  allows  efficacious 
classification  of  musical  instruments  is  quite  difficult,  because  the  timbre  of  sound 
depends  on  many  circumstances.  Additionally,  the  timbre  changes  within  musical 
scale  of  an  instrument,  that  makes  classification  (that  should  be  correct  independent 
on  the  pitch)  even  more  difficult.  That  is  why  such  a  parameterization  has  to  be  done 
very  carefully,  involving  quite  sophisticated  recipes  of  parameterization. 

The  next  stage  of  this  process  is  classification  of  the  calculated  data.  The  author 
decided  to  use  rough  set  based  algorithms  and  decision  trees,  since  their  outcomes  are 
easy  to  interpret  and  they  identify  the  most  useful  attributes  in  the  proposed 
parameterization.  These  methods  show  the  number  of  necessary  attributes  and  allow 
evaluation  of  their  importance. 
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Wavelet  based  parameterization  described  here  is  very  simple,  and  the  results  are 
somewhat  weaker  than  for  more  sophisticated  methods,  based  on  Fourier  analysis  and 
pitch  calculation.  Unfortunately,  precise  pitch  extraction  is  usually  elaborated  for  each 
instrument  independently,  and  may  introduce  octave  errors.  Wavelet  based 
parameterization  does  not  require  pitch  calculation,  and  after  some  improvement,  can 
be  quite  helpful  in  classifying  musical  instrument  sounds. 
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Abstract.  This  paper  focuses  on  non-Horn  formulas  for  the  class  of 
regular  signed  logics,  also  known  as  annotated  logics.  Resolution-based 
inference  systems  for  these  logics  are  not  new,  but  most  earlier  work  has 
concentrated  on  Horn  formulas,  to  which  the  logic  programming  para¬ 
digm  applies.  Here  a  restriction  of  annotated  resolution  and  reduction 
called  annotated  hyperresolution  is  introduced.  The  new  rule  is  developed 
for  arbitrary  CNF  formulas  of  regular  signed  logics  and  is  shown  to  be 
complete. 

Keywords:  Logic  for  AI,  75-resolution,  hyperresolution,  inference, 
multiple-valued  logic,  signed  and  annotated  logic 

1  Introduction 

Hyperresolution  is  an  example  of  a  theorem  proving  technique  that  employs 
macro  steps.  Such  inference  rules  usually  impose  significant  restrictions  on  what 
choices  are  admissible  in  the  search  for  a  proof.  Another  important  feature  is 
that  only  the  conclusions  of  the  macro  steps  are  retained,  not  the  conclusions 
of  the  constituent  steps.  In  this  paper  hyperresolution  is  extended  to  a  class  of 
multiple  valued  logics  (MVL’s). 

Signed  logics  [13,4]  provide  a  general1  framework  for  reasoning  about  MVL’s. 
They  evolved  from  a  variety  of  work  on  non-standard  computational  logics,  in¬ 
cluding  [2,4,9,11,15,20],  The  key  is  the  attachment  of  signs — subsets  of  the  set 
of  truth  values — to  formulas  in  the  MVL.  This  approach  is  appealing  because  it 
facilitates  the  utilization  of  classical  techniques  for  the  analysis  of  non-standard 
logics,  reflecting  the  classical  nature  of  human  reasoning.  That  is,  regardless  of 
the  domain  of  truth  values  associated  with  a  logic,  at  the  meta-level,  humans 
interpret  statements  about  the  logic  to  be  either  true  or  false. 

This  paper  focuses  on  the  class  of  regular  signed  logics.  These  logics  are  of 
interest  in  the  knowledge  representation  and  logic  programming  communities 

*  This  research  was  supported  in  part  by  the  National  Science  Foundation  under  grant 
CCR-9731893. 

1  Hahnle,  R.  and  Escalada-Imaz,  G.  [7]  have  an  excellent  survey  encompassing  deduc¬ 
tive  techniques  for  a  wide  class  of  MVL’s,  including  (properly)  signed  logics. 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  301-310,  2000. 
(c)  Springer- Verlag  Berlin  Heidelberg  2000 


302  J.J.  Lu,  N.V.  Murray,  and  E.  Rosenthal 


because  they  correspond  to  the  class  of  paraconsistent  logics  known  as  annota¬ 
ted  logics,  introduced  by  Subrahmanian  [19],  Blair  and  Subrahmanian  [2],  and 
Kifer  et  al.  [10,21].  In  [13],  they  were  also  shown  to  capture  fuzzy  logics,  but 
in  this  paper,  regular  signed  logics  will  refer  to  annotated  logics.  In  most  of 
the  work  on  annotated  logics,  the  focus  has  been  on  Horn  sets,  widely  applied 
within  logic  programming.  The  inference  rule  annotated  hyperresolution  is  deve¬ 
loped  in  this  paper  for  arbitrary  regular  signed  formulas  in  conjunctive  normal 
form  (CNF).  Establishing  completeness  involves  substantive  reformulations  of 
techniques  developed  for  the  classical  counterparts  (see  [1,8]). 

There  has  been  related  work  by  Sofronie-Stokkermans  on  adapting  hyperre¬ 
solution  to  logics  with  truth  value  sets  based  on  finite  distributive  lattices  [17, 
18].  Under  these  assumptions,  signs  can  be  restricted  to  prime  filters  and  their 
complements.  This  eliminates  the  need  to  resolve  more  than  one  positive  literal 
against  one  negative  literal.  A  similar  situation  occurs  with  regular  signs  when 
the  truth  value  set  is  linearly  ordered.  Hahnle  has  exploited  this  to  obtain  re¬ 
solution  refinements  in  [5]  and  to  introduce  a  version  of  hyperresolution  under 
these  conditions  [6]. 

The  next  section  is  a  summary  of  the  basic  ideas  of  signed  formulas  and 
annotated  logics.  Theorems  1-4  in  Section  2.3  were  proved  in  [12].  The  main 
results  are  found  in  Section  3:  The  pure  rule  is  adapted  to  signed  and  annotated 
logics  in  Section  3.1;  annotated  hyperresolution  is  developed  in  Section  3.2. 

2  Signed  Logics 

Detailed  descriptions  of  the  basics  of  signed  logics  can  be  found  in  [15]  and 
in  [13];  the  presentation  in  this  section  is  brief. 

Given  a  language  A,  let  A  be  a  complete  lattice  of  truth  values  under  some 
ordering  <  2 . 

A  sign  is  a  subset  of  A,  and  a  signed  formula  is  an  expression  of  the  form 
5 :  T ,  where  S'  is  a  sign  and  F  is  a  formula  in  A. 

To  answer  arbitrary  queries,  we  represent  queries  about  formulas  in  A  by 
formulas  in  a  classical  logic  As,  the  language  of  signed  formulas-,  it  is  defined 
as  follows:  The  literals  are  signed  formulas  and  the  connectives  are  (classical) 
conjunction  and  disjunction.  It  should  be  emphasized  that  a  signed  formula 
S :  T  is  a  literal  in  A 5  regardless  of  the  size  or  complexity  of  T  and  thus  has  no 
component  parts  in  the  language  A$.  The  set  of  truth  values  is  {true,  false}. 
A  formula  in  As  is  defined  to  be  A- atomic  if  whenever  S :  A  is  a  literal  in  the 
formula,  then  A  is  an  atom  in  A. 

An  arbitrary  interpretation  for  A 5  may  make  an  assignment  of  true  or  false 
to  any  signed  formula  (i.e.,  to  any  literal)  in  the  usual  way.  To  focus  attention 
only  on  those  interpretations  that  relate  to  the  sign  in  a  signed  formula,  restrict 
attention  to  A-consistent  interpretations.  An  interpretation  /  over  A  assigns  to 

2  As  usual,  the  greatest  and  least  elements  of  A  are  denoted  T  and  _L,  respectively,  and 
Sup  and  Inf  denote,  respectively,  the  supremum  (least  upper  bound)  and  infimum 
(greatest  lower  bound)  of  a  subset  of  A. 
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each  literal,  and  therefore  to  each  formula  J-,  a  truth  value  in  A,  and  the  corre¬ 
sponding  /1-consistent  interpretation  Ic  is  defined  by  IC{S\J-)  =  true  if  I  (T)  G  S ; 
IC{SS)  =  false  if  I{T)  0  5. 

The  annotated  hyperresolution  rule  developed  in  this  paper  lifts  in  the  usual 
way;  attention  is  mostly  restricted  to  the  ground  case  in  this  paper. 

2.1  Signed  Resolution 

In  this  section,  we  review  a  method  for  adapting  resolution  to  signed  formulas. 
The  inference  rules  15-resolution,  introduced  in  [13],  and  annotated  hyperresolu¬ 
tion,  defined  later,  are  based  on  a  generalized  notion  of  complementary  literals, 
which  is  characterized  in  the  next  lemma. 

Lemma  1.  (The  Reduction  Lemma)  Let  S^rA  and  S^-.A  be  A-atomic  atoms  in 
As;  then  S\ :  A  A  :  A  {Si  fl  S2) :  A  and  Si :  A  V  S2 :  A  =a  {Si  U  62) :  A.  □ 

Consider  a  A-atomic  formula  J-  in  As  in  conjunctive  normal  form  (CNF). 
Let  Cj,l  <  j  <  r,  be  clauses  in  T  that  contain,  respectively,  A-atomic  literals 
{Sj-.A}.  Thus  we  may  write  Cj  —  Kj  V  {Sy.A}.  Then  the  resolvent  R  of  the  Cfs 
is  defined  to  be  the  clause 

fcHv(fcs’)4 

The  rightmost  disjunct  is  called  the  residue  of  the  resolution;  observe  that  it  is 
unsatisfiable  if  its  sign  is  empty  and  satisfiable  if  it  is  not.  In  the  former  case,  it 
may  simply  be  deleted  from  R. 

In  clausal  resolution  systems,  merging  is  crucial.  (Consider  unsatisfiable  clau¬ 
se  sets  for  which  the  minimal  clause  size  is  two.)  In  this  paper,  we  treat  clauses 
as  sets,  and  merging  is  assumed. 

Observe  that  if  S  C  S',  and  if  two  clauses  are  resolved  on  the  literals  S:A 
and  S’: A,  then  the  residue  will  be  5: A  (after  all,  S  f)  S'  =  S),  so  the  clause 
containing  S :  A  must  entail  the  resolvent.  This  proves 

Lemma  2.  The  resolvent  produced  by  resolving  on  two  literals  in  which  the 
sign  of  one  contains  the  sign  of  the  other  is  entailed  by  one  of  its  parents.  □ 

2.2  Regular  Signed  Formulas  and  Annotated  Logics 

Let  (P;  A)  be  any  partially  ordered  set,  and  let  Q  C  P.  Then  j Q  =  {y  £ 
P | (3a:  &  Q)  x  A  y}-  Note  that  fQ  is  the  smallest  upset  containing  Q  (see  [3]).  If 
Q  is  a  singleton  set  {x},  then  we  simply  write  fx.  We  say  that  a  subset  Q  of  P 
is  regular  if  for  some  x  €  P,  Q  =  fx  or  Q  =  (t^)/  (the  set  complement  of  fx). 
We  call  x  the  defining  element  of  the  set.  In  the  former  case,  we  call  Q  positive, 
and  in  the  latter  negative.  Observe  that  both  A  and  0  are  regular  since  A  =  j~-L 
and  0  =  A'.  Observe  also  that  if  z  =  Sup{x,  y},  then  j~x  fl  fy  =  fz.  A  signed 
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formula  is  regular  if  every  sign  that  occurs  in  it  is  regular.  Note  that  we  may 
assume  that  no  regular  signed  formulas  have  any  signs  of  the  form  (j'-L)'. 

An  annotated  logic  is  a  signed  logic  in  which  only  regular  signs  are  allowed. 

A  regular  sign  is  completely  characterized  by  its  defining  element,  say  x,  and 
its  polarity  (whether  it  is  positive  or  negative).  A  regular  signed  atom  may  be 
written  ~\x  :  A,  while  the  complement  is  the  set  (tx)' :  A.  Observe  that  (tx)' : 
A  —  ~  (|x :  A);  that  is,  the  signed  atoms  are  complementary  with  respect  to 
A-consistent  interpretations.  With  annotated  logics,  the  most  common  notation 
is  T :  x  and  ~  E :  x.  There  is  no  particular  advantage  of  one  or  the  other,  and  it 
is  perhaps  unfortunate  that  both  have  arisen.  We  will  follow  the  x :  E  convention 
when  dealing  with  signed  logics  and  use  E:x  for  annotated  logics. 


2.3  Signed  Resolution  for  Annotated  Logics 

A  sound  and  complete  resolution  proof  procedure  was  defined  for  clausal  annot¬ 
ated  logics  in  [9].  The  procedure  contains  two  inference  rules  that  we  will  refer 
to  as  annotated  resolution  and  reduction ,3  These  two  inference  rules  correspond 
to  disjoint  instances  of  signed  resolution.  Two  annotated  literals  L\  and  L2  are 
said  to  be  complementary  if  they  have  the  respective  forms  A :  p  and  (A:p), 
where  p  >  p,  and  annotated  resolution  is  defined  as  follows:  Given  the  annotated 
clauses  ( L\  V  Di)  and  (L?  V  D2),  where  Li  and  L2  are  complementary,  then 
the  annotated  resolvent  of  the  two  clauses  on  the  annotated  literals  L\  and  L2 
is  Di  V  D2. 

Two  clauses  can  be  so  resolved  only  if  the  annotation  of  the  positive  annot¬ 
ated  literal  that  is  resolved  upon  is  greater  than  or  equal  to  the  annotation  of 
the  negative  literal  resolved  upon.  In  that  case  the  two  clauses  are  said  to  be 
resolvable  on  the  annotated  literals  L\  and  L2. 

The  reduction  rule  is  defined  when  two  occurrences  of  an  atom  have  positive 
signs.  Suppose  ( A :  p\S/E\)  and  (A : p2 VE2)  are  two  annotated  clauses  in  which  p\ 
and  P2  are  incomparable.  Then  the  annotated  clause  (A :  Sup{pi,  p2})V  Ei  Vf?2  is 
called  a  reductantoi  the  two  clauses,  and  we  say  that  the  two  clauses  are  reducible 
on  the  annotated  literals  A:pi  and  A: p2-  A  reduction  step  may  be  required  to 
produce  a  positive  sign  that  in  turn  enables  an  annotated  resolution  step. 

It  is  straightforward  to  see  that  the  two  inference  rules  are  both  captured 
by  signed  resolution.  In  particular,  annotated  resolution  corresponds  to  an  ap¬ 
plication  of  signed  resolution  (to  regular  signed  clauses)  in  which  the  signs  of 
the  selected  literals  are  disjoint.  Reduction  on  the  other  hand,  corresponds  to 
an  application  of  signed  resolution  in  which  the  signs  of  the  selected  literals  are 
both  positive  and  thus  have  a  non-empty  regular  intersection. 

Theorem  1.  Suppose  that  E  is  a  set  of  annotated  clauses  and  that  V  is  a 
deduction  of  E  using  annotated  resolution  and  reduction.  Then  I?  is  a  signed 

3  Kifer  and  Lozinskii  refer  to  their  first  inference  rule  simply  as  resolution.  Howe¬ 
ver,  since  we  are  working  with  several  resolution  rules  in  this  paper,  appropriate 
adjectives  will  be  used  to  avoid  ambiguity. 
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deduction  of  J-.  In  particular,  if  T  is  an  unsatisfiable  set  of  first  order  annotated 
clauses,  then  there  is  a  signed  refutation  of  E .  □ 

The  viewpoint  of  signed  logics  provides  insight  into  annotated  logics.  At  the 
same  time,  the  restriction  to  regular  signs  has  practical  advantages.  The  next 
theorem,  which  will  be  quite  useful  in  Section  3,  is  an  example. 

Theorem  2.  Suppose  Si,...  ,Sn  are  regular  signs  whose  intersection  is  empty, 
and  suppose  that  no  proper  subset  of  {Sj, . . .  ,Sn}  has  an  empty  intersection. 
Then  exactly  one  sign  is  negative;  i.e.,  for  some  j,  1  <  j  <  n,Sj  =  (^Xj)' ,  and 
for  i  ^  j,  Si  =  j 'Xi,  where  aq, . . . ,  xn  G  A.  □ 

The  intersection  of  a  positive  regular  sign  and  a  negative  regular  sign  is 
regular  if  and  only  if  it  is  empty,  and  two  negative  signs  can  have  a  regular 
intersection  if  and  only  if  one  is  a  subset  of  the  other.  In  view  of  Lemma  2,  the 
latter  situation  need  never  be  considered,  so  a  signed  deduction  is  defined  to  be 
regular  if  every  sign  that  appears  in  the  deduction  is  regular  and  if  no  residue 
sign  is  produced  by  the  intersection  of  two  negative  signs.  The  next  two  theorems 
are  immediate.  Theorem  4  states  that  the  class  of  regular  signed  deductions  is 
precisely  the  class  of  deductions  using  annotated  resolution  and  reduction.  As 
a  result,  restricting  signed  resolution  to  regular  clauses  captures  annotated  re¬ 
solution  and  reduction  without  increasing  the  search  space.  Deductions  obeying 
this  restriction  are  called  regular. 

Theorem  3.  A  signed  deduction  of  a  regular  formula  is  regular  if  and  only  if 
the  sign  of  every  satisfiable  residue  is  produced  by  the  intersection  of  two  positive 
regular  signs.  □ 

Theorem  4.  Let  V  be  a  sequence  of  annotated  clauses.  Then  V  is  an  annotated 
deduction  if  and  only  if  V  is  a  regular  signed  deduction.  □ 

It  follows  from  the  theorem  that  regular  signed  resolution  is  complete. 
Corollary.  Suppose  T  is  an  unsatisfiable  set  of  regular  signed  clauses.  Then 
there  is  a  regular  signed  deduction  of  the  empty  clause  from  J- .  □ 

3  Regular  Signed  Deduction  with  Non-horn  Sets 

The  ideas  described  in  Section  2.3  were  employed  in  [12]  to  develop  O-resolution 
for  annotated  logic  programs.  One  nice  feature  of  O-resolution  is  that  it  allows 
simple  SLD-style  proof  procedures  for  annotated  logic  programs  over  a  large 
class  of  lattices.  It  does  so  eliminating  the  expensive  reduction  rule,  yet  it  does 
not  require  irregular  deductions.  Moreover,  for  any  deduction  using  annotated 
resolution  and  reduction,  there  is  a  corresponding  deduction  using  O-resolution 
that  is  at  least  as  short. 

These  advantages  apply  to  the  logic  programming  paradigm.  In  this  paper, 
the  more  general  non-Horn  setting  is  addressed.  Every  deduced  clause  is  cached, 
subject  perhaps  to  certain  deletion  strategies.  We  begin  by  adapting  the  notion 
of  purity  from  classical  logic  to  signed  and  annotated  logics. 
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3.1  Purity 

Adapting  the  notion  of  purity  to  signed  and  annotated  logics  is  not  completely 
straightforward.  In  classical  logic,  a  literal  in  a  set  of  clauses  is  said  to  be  pure 
if  its  complement  does  not  occur  in  any  other  clause.  In  that  case,  the  clause 
containing  the  pure  literal  is  also  said  to  be  pure.4 

The  literal  set  (conjunction)  L  =  {Sh  :  A,  S2 :  A, . . . ,  Sm :  A]  is  unsatisfiable 
if  (Pi™  1  Si)  =  0;  L  is  minimally  unsatisfiable  if  the  removal  of  any  literal  from 
L  produces  a  satisfiable  set.  A  literal  in  a  set  S  of  clauses  is  pure  if  it  does 
not  belong  to  any  minimally  unsatisfiable  set  of  literals  in  which  distinct  literals 
occur  in  distinct  clauses.  In  that  case,  the  clause  containing  the  pure  literal  is 
also  said  to  be  pure. 

Observe  that  it  is  necessary  to  include  minimally  unsatisfiable  in  the  defini¬ 
tion:  It  is  possible  for  {1}  U  L  to  be  unsatisfiable  but  for  l  not  to  be  in  any 
minimally  unsatisfiable  subset.  A  trivial  example  is  for  L  to  be  any  unsatisfiable 
literal  set  and  for  Z  to  be  A :  A.  (In  essence,  Z  is  the  constant  true.)  Looked  at 
another  way,  the  definition  assures  that  if  Z  is  not  pure,  its  removal  makes  some 
unsatisfiable  literal  set  in  which  it  resides  satisfiable. 

Lemma  3  (Signed  Pure  Rule).  Let  S  be  a  set  of  signed  clauses  in  which 
the  clause  C  contains  pure  literal  S :  A.  Then  <S  is  unsatisfiable  if  and  only  if 
S'  =  S  —  {C}  is  unsatisfiable.  □ 

For  annotated  logics,  the  literal  set  L  =  {A  :  xi,  A  :  X2,  ■  .  ■ ,  A  :  xm, 
~  A  :  xm+i,  . . . ,  ~  A  :  xm+r]  is  unsatisfiable  if  m,  r  >  0  and,  for  some  j, 
m+l<j<m  +  r,  Supf-Ti, . . . ,  xm}  >  x  j.  Then  L  is  minimally  unsatisfiable 
if  the  removal  of  any  member  of  L  results  in  a  satisfiable  literal  set.  In  light  of 
Theorem  2,  r  =  1  for  any  minimally  unsatisfiable  annotated  literal  set. 

A  literal  in  a  set  S  of  annotated  clauses  is  said  to  be  pure  if  it  does  not  belong 
to  any  minimally  unsatisfiable  set  of  literals  in  which  distinct  literals  occur  in 
distinct  clauses.  In  that  case,  the  clause  containing  the  pure  literal  is  also  said 
to  be  pure. 

Lemma  4  (Annotated  Pure  Rule).  Let  5  be  a  set  of  annotated  clauses  in 
which  the  clause  C  contains  the  pure  literal  Z.  Then  S  is  unsatisfiable  if  and  only 
if  S’  =  S  —  {C}  is  unsatisfiable.  □ 

The  next  lemma  is  useful  in  proving  the  completeness  of  annotated  hyperre¬ 
solution  (Section  3.2). 

Lemma  5.  Let  S  =  {C0,  Ci,  C2,  ■  ■  ■ ,  Ok}  be  a  minimally  unsatisfiable  set  of 
annotated  clauses  (i.e.,  no  proper  subset  of  S  is  unsatisfiable),  and  suppose 
L'o  =  (Z}U{Zi,Z2,...  ,/■„},  n  >  0.  Obtain  S'  from  5  by  deleting  every  occurrence 
of  Z  in  S.  Then  S'  is  unsatisfiable,  and  every  minimally  unsatisfiable  subset  of 
S'  contains  C'0  =  {Zj, . . . ,  Z„}.  □ 

4  The  pure  rule  states  that  a  set  of  clauses  is  unsatisfiable  iff  the  set  with  all  pure 
clauses  removed  is  unsatisfiable.  This  rule  does  extend  to  signed  logics,  but  the 
definition  of  purity  must  be  properly  formulated:  A  literal  in  a  clause  might  not  be 
pure  even  though  no  other  clause  contains  a  complementary  literal. 
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3.2  Annotated  Hyperresolution 

Recall  that  annotated  resolution  consists  of  two  inference  rules,  annotated  reso¬ 
lution  and  reduction.  Both  of  these  are  special  cases  of  signed  resolution  restric¬ 
ted  to  regular  signs.  The  annotated  hyper  resolution  rule  defined  below  can  be 
thought  of  as  a  single  rule  that  executes  several  regular  signed  resolution  steps 
at  once. 

Let  5  be  a  set  of  annotated  clauses.  Clauses  are  defined  to  be  positive  or 
negative  as  they  are  for  hyperresolution  in  classical  logic  5:  A  clause  is  positive 
if  it  does  not  contain  any  negative  literals,  and  it  is  negative  if  it  contains  at 
least  one  negative  literal.  Note  that  if  S  is  unsatisfiable,  it  must  contain  at  least 
one  clause  with  only  negative  literals  and  at  least  one  clause  with  only  positive 
literals. 

Nucleus  and  satellite  clauses  are  required  to  define  annotated  hyperresolu¬ 
tion,  as  they  are  in  the  classical  case.  However,  the  definitions  are  a  bit  more 
complicated  for  the  annotated  case.  Let 

N  =  (UJJ=1  ~bk:pk)  U  C 

be  a  negative  clause  in  <S,  where  C  is  positive,  and  select  sets  of  positive  clauses 
...,Bn  as  follows. 

For  1  <  k  <  n,  Bk  =  {Bki,Bk2,...,Bknk},  where  bk  :  (3kt  €  Bku  and 
Pk  <  Sup{/3fci,...,/3fenJ;  let  (3k  =  Sup{/3fcl, . . .  ,/3fcnJ.  Intuitively,  for  each 
~  bk  :  pk  £  N,  Bk  consists  of  nk  positive  clauses,  and  each  clause  contains 
an  annotated  atom  of  the  form  bk:f3kt.  Furthermore,  the  members  of  Bk  can  be 
used  to  form  a  reductant  B ^  containing  the  annotated  atom  bk  :  (3k,  and  this 
reductant  resolves  against  the  literal  ~  bk :  pk  in  the  nucleus  clause. 

It  turns  out  that  the  following  additional  condition,  which  further  restricts 
the  search,  is  useful:  For  any  b'  :  (3'  £  Bkt  —  {bk  :  (3kt },  b'  :  /3'  ^  bj  :  (3ji,  1  < 
j  <  n,  1  <  l  <  rij\  b' :  (3'  is  not  one  of  the  atoms  contributing  to  the  residue  of 
some  Bj^ .  (Intuitively,  all  such  atoms  are  ideally  resolved  away  by  ~  bk:pk,  and 
we  prohibit  them  from  being  reintroduced  by  other  satellites.)  This  condition  is 
referred  to  as  the  satellite  redundancy  condition  in  the  proof  of  Theorem  6. 

Then  the  clause 

R  =  (U 2=1(B? -{&*:&}))  U  C 

is  the  annotated  hyperresolvent  of  the  nucleus  clause  N  and  the  satellite  clauses 
■  ■  ■  ,B  nnn  • 

Obviously,  I?  is  a  positive  clause.  That  R  can  be  soundly  inferred  from  N 
and  the  Bk  s  can  easily  be  seen  by  noting  that  a  sequence  of  binary  annotated 
resolutions  between  N  and  the  B^'s  produces  R  (and  the  Sj^’s  are  the  result  of 
a  sequence  of  reductions).  Semantically,  any  interpretation  /  that  satisfies  the 
parent  clauses  either  satisfies  one  literal  in  C  (and  thus  R)  or  satisfies  some 
~  bk  :  Pk-  But  then  I  falsifies  bk  :  pk  and  thus  falsifies  bk  :  f3k  as  well.  Hence, 
some  bk  '■  (3k j  is  falsified  in  Bkj.  Since  Bkj  is  satisfied  by  7,  so  is  some  literal  in 
Bkj  —  {bk'-Pkj},  i-e->  some  literal  in  R.  This  proves 

5  This  is  Robinson’s  original  terminology  [16];  others  have  used  mixed  and  negative 
to  describe  non-positive  clauses 
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Theorem  5.  Annotated  Hyperresolution  is  a  sound  rule  of  inference  for  annot¬ 
ated  logic.  □ 

Consider  an  example  over  the  lattice  SIX,  which  is  a  six  element  lattice 
containing  two  chains:  J_  <  It  <  t  <  T  and  ±  <  If  <  f  <  T. 

Suppose  we  have  the  following  unsatisfiable  set  of  annotated  clauses:  (1)  {p: 
t,  ~  q  :  T,  ~  r  :  t,  ~  s  :  f},  (2)  {q  :  It,  p  :  f},  (3)  {5  :  If},  (4)  {r  :  t}, 
(5)  (s:t,  p:t},  (6)  {s:lf},  (7)  {~p:lt},  and  (8)  {~p:lf}. 

An  annotated  hyperresolution  proof  is  obtained  by  using  clause  (1)  as  nucleus 
for  satellites  (2)  through  (6)  to  produce  p :  t  V  p :  f .  It  then  serves  as  the  only 
required  satellite  in  the  remainder  of  the  deduction,  in  which  clauses  (7)  and  (8) 
serve  as  nuclei. 

Annotated  hyperresolution  is  complete;  the  proof  is  not  trivial. 

Theorem  6.  Annotated  hyperresolution  is  refutation  complete  for  propositio¬ 
nal  annotated  logic. 

Proof.  Let  <S  =  {C\,  C%, . . . ,  Cm}  be  an  unsatisfiable  set  of  annotated  clauses. 
Assume  that  S  is  minimally  unsatisfiable;  otherwise,  restrict  attention  to  a  mi¬ 
nimally  unsatisfiable  subset.  We  must  show  that  there  is  a  refutation  of  S  using 
annotated  hyperresolution. 

Proceed  by  induction  on  the  number  n  of  literal  occurrences  in  S.  If  there 
are  none,  then  S  is  {0},  and  we  are  done.  Note  that  S  cannot  contain  exactly 
one  literal  occurrence. 

So  suppose  that  all  minimally  unsatisfiable  annotated  clause  sets  with  at 
most  n  literal  occurrences  can  be  refuted  with  annotated  hyperresolution,  and 
assume  that  S  has  size  n  +  1.  Let  q  be  a  predicate  occurring  in  S.  Note  that 
since  S  is  minimal,  the  Pure  Rule  implies  that  q  cannot  be  pure,  so  there  are 
minimally  unsatisfiable  literal  sets,  whose  members  have  q  as  the  predicate  and 
are  taken  from  distinct  clauses. 

Consider  the  set  {q  :  aq,  q  :  £2, . . . ,  q  :  xr+  }  of  all  positive  5-literals  in  <S.  We 
shall  first  show  that  for  each  i,  1  <i<r+,  the  unit  clause  {q :  Xi}  can  be  derived 
by  annotated  hyperresolution.  To  do  so,  remove  all  occurrences  of  q :  Xi  from  S 
to  produce  S[.  This  formula  is  unsatisfiable  by  Lemma  5.  Consider  a  minimally 
unsatisfiable  subset;  by  the  same  lemma,  every  clause  that  came  from  a  clause 
in  S  containing  q:Xi  is  in  this  set.  Since  the  number  of  literals  in  S’  is  at  most 
n,  the  induction  hypothesis  applies,  and  there  is  a  refutation  7 Zqi  by  annotated 
hyperresolution. 

Now  apply  that  refutation  to  <S;  that  is,  construct  a  deduction  in  S  by  doing 
the  identical  annotated  hyperresolution  steps  with  the  exception  that  each  dele¬ 
ted  q:Xi  is  included.  Observe  that  the  effect  is  that  whenever  a  clause,  whether 
nucleus  or  a  satellite,  contains  q:xi,  that  literal  is  added  to  the  resolvent.  Note 
also  that  the  satellite  redundancy  condition  is  obeyed:  Satellites  do  not  reintro¬ 
duce  any  of  the  positive  literals  that,  collectively,  resolve  against  the  negative 
nucleus  literal.  The  deduction  in  S'  has  this  property  by  the  induction  hypothe¬ 
sis,  and  q-.Xi  does  not  occur  in  S'. 

Call  the  resulting  deduction  (it  may  no  longer  be  a  refutation)  7 Z'q.  which, 
with  merging,  may  produce  the  unit  clause  {q: Xi}  rather  than  the  empty  clause. 
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Nevertheless,  each  step  is  an  annotated  hyperresolution  step.  The  reason  is  that 
by  reintroducing  positive  occurrences  oiq-.Xj,  the  status  of  all  clauses  as  nucleus 
or  as  satellite  in  an  inference  step  is  unchanged.  Thus  each  of  the  r+  unit  clauses 
{q:Xi},  1  <  i  <  r+  may  be  derived  with  annotated  hyperresolution. 

Consider  the  negative  occurrences  ~  q-Vj,  1  <  j  <  r~  of  q  in  S.  Note  that, 
since  none  is  pure,  Sup{a;i, . . .  ,av+}  >  Dj,  1  <  j  <  r~.  (Otherwise,  ~  q  '■  Uj 
would  not  be  in  an  unsatisfiable  literal  set.)  Thus  some  subset  of  the  positive 
units  {q  :  X\ , ..., q  :  xr+}  will  suffice  to  resolve  away  any  particular  negative 
literal  containing  q. 

Now  delete  from  S  all  occurrences  of  ~  q  :  y j  for  some  j.  The  resulting 
formula  is  unsatisfiable,  and  a  refutation  by  annotated  hyperresolution 

can  be  found.  Let  the  proof  that  results  from  applying  lZ^qj  to  S  be  denoted  by 
7 Z'~g..  This  proof  yields  either  the  empty  clause  or  the  unit  {~  q : yj } .  However, 
it  may  fail  to  be  an  annotated  hyperresolution  proof  in  two  ways.  Reintroducing 
~  qj  :  yj  into  a  clause  may  add  to  the  negative  literals  in  a  nucleus  clause  or 
may  convert  a  positive  (satellite)  clause  into  a  negative  cause.  In  the  first  case, 
the  nucleus  would  have  a  negative  literal  not  resolved  away,  and  in  the  second 
case,  a  clause  that  cannot  act  as  a  satellite  would  be  produced.  In  either  case,  an 
annotated  hyperresolution  proof  can  be  constructed  from  72/j(^  using  the  units 
{q:xi}  produced  by  the  deductions  7Z'q_,  as  we  show  below. 

Suppose  first  that  a  step  in  7l~gj  with  nucleus  N  =  {~  bi  :  yi, . . . ,  ~  bn  : 
yn,  ci, ...,  cm}  adds  ~  q :  y3  to  N.  As  noted  earlier,  some  subset  of  the  derived 
unit  clauses  {q:Xi}  resolve  away  ~  q :  y3 .  Employing  these  units  as  additional 
satellites  results  in  an  annotated  hyperresolution  step  in  which  the  deduced 
(positive)  clause  is  exactly  the  same  as  in  7Z^qj . 

Suppose  now  that  ~  qj  :  yj  is  added  to  a  satellite  clause  B,  producing  a 
negative  clause  B'  with  one  negative  literal.  If  B'  is  used  as  a  nucleus  clause, 
and  if  the  derived  unit  clauses  that  resolve  away  ~  qj  :  yj  are  used  as  satellite 
clauses,  then  B  is  the  annotated  hyperresolvent.  Note  that  this  construction 
assures  that  the  last  step  that  produced  ~  q  :  yj  in  now  produces  the 

empty  clause. 

Again  the  satellite  redundancy  condition  is  obeyed;  in  7Z^gj  by  the  induction 
hypothesis,  and  in  since  the  only  changes  introduced  involve  additional 

collections  of  unit  satellites. 

Finally,  combining  the  deductions  VJq .  and  the  modified  deduction  B'^q.  pro¬ 
duces  the  required  annotated  hyperresolution  refutation.  □ 
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Abstract.  The  research  works  on  axioms  of  Kleene  algebra  are  surveyed  and 
the  fundamental  properties  of  axioms  of  Kleene  algebra  are  clarified  through 
the  method  of  indeterminate  coefficients..  Especially  the  algorithm  for  checking 
a  given  axiom  of  Kleene  algebra  is  independent  or  not  from  the  other  axioms  is 
shown.  Finally,  all  finite  models  of  Kleene  algebra  of  8  elements  are  derived  as 
an  example. 


1  Introduction 

Kleene  algebra  was  firstly  proposed  and  investigated  by  J.  A.  Kalmann(1)  under  the 
name  of  “a  normal  i-lattice”,  and  in  the  book  of  R.  Balbes  and  P.  Dwinger(2)  it  was 
described  that  the  name  “Kleene  algebra”  was  found  in  the  paper  of  D.  Brignole  and 
A.  Monteiro®. 

On  the  other  hand,  after  fuzzy  logic  was  proposed  by  L.  A.  Zadeh<4)  to  treat 
ambiguous  states  or  phenomena  exist  in  the  real  world,  many  researchers  investigated 
the  algebraic  structures  of  fuzzy  logic.  Almost  all  of  them  characterized  the  fuzzy 
logic  as  De  Morgan  algebra.  Among  them,  only  F.  P.  Preparata  and  R.  T.  Yeh(5)  and 
M.  Mukaidono<6)  pointed  out  that  Kleene’ s  law  av  a)  >(~  b»b)  holds  in  fuzzy 
logic.  In  such  situations,  M.  Mukaidono<7)  declared  firstly  that  fuzzy  logic  is  a  model 
of  “Kleene  algebra”  and  found  out  the  independent  and  complete  axioms  for  Kleene 
algebra,  and  after  that  he  clarified  the  canonical  forms  of  free  Kleene  algebra  under 
the  name  of  fuzzy  switching  functions<8)~{10)  (some  times  in  the  literatures  this  algebra 
was  called  also  as  “fuzzy  algebra”  or  “soft  algebra”).  Kleene  algebra  is  a  weaker 
algebra  than  Boolean  algebra  and  a  stronger  algebra  than  De  Morgan  algebra  because 
Kleene  algebra  is  De  Morgan  algebra  satisfying  Kleene’s  laws  and  Kleene’s  laws  are 
weaker  version  of  the  law  of  excluded  middle  that  is  the  essential  law  in  Boolean 
algebra.  It  was  already  shown  that  Kleene  algebra  is  essentially  3-valued(2),<10). 

Recently,  Kleene  algebra  appears  in  many  fields  and  plays  essential  roles  to 
represent  ambiguous  or  uncertainty  states  especially  in  the  field  of  intelligent  systems. 
In  this  paper  the  research  works  on  axioms  of  Kleene  algebra  are  surveyed  and  the 
fundamental  properties  of  axioms  of  Kleene  algebra  are  clarified  through  the  method 
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of  indeterminate  coefficients* U),  where  the  method  of  indeterminate  coefficients  is  a 
strong  tool  to  derive  all  finite  models  satisfying  the  given  a  set  of  all  axioms. 
Especially  we  show  a  given  axiom  of  Kleene  algebra  is  independent  or  not  from  the 
other  axioms,  and  through  these  investigations  we  can  clarify  the  properties  of  each 
axiom  of  Kleene  algebra  and  a  set  of  independent  and  complete  axioms  of  Kleene 
algebra  is  shown.  Finally,  all  finite  models  of  Kleene  algebra,  for  the  number  of 
elements  being  8  are  derived  as  an  example. 


2  Axioms  of  Kleene  Algebra 

In  the  fuzzy  theory*41  ,  at  first,  three  kinds  of  set  operations  AuB,AnB  and  Ac  are 
defined  as  follows: 

Ma^b (a)  =  juA(a)v  jUB(a) 

H-AnB  (a)  =  juA(a)»  uB(a) 

JUAC  {a)  =~^W- 

where  jUA(a)  and  JUB  (a)  are  membership  values  for  an  element  a  to  belong  to 
the  fuzzy  set  A  and  B,  respectively,  and  take  any  values  of  the  unit  interval  [0,1],  and 
v ,  •  and  -  are  logic  operations  and  defined  as 
[Definition  1] 
a  v  b=max(a,  b) 
a  •  b=min(a,  b) 

~a=l-a 

where  a,  b  are  elements  of  [0,1]. 

The  logic  operations  v  and  •  in  Definition  1  are  afterward  generalized  into  t- 
conorm  and  t-norm,  respectively,  but  the  above  definitions  of  logic  operations 
OR(  v ),  AND(  • )  and  NOT(~)are  always  fundamental  and  essential  in  fuzzy  logic. 

It  is  easily  shown  that  the  above  definitions  of  a  set  of  logic  operations '{  v ,  • ,  ~ } 
satisfy  the  following  equalities  listed  in  Table  1  in  the  closed  interval  [0,1]. 

If  an  algebraic  system  satisfies  the  equations  [1]~[4]  then  it  is  a  lattice,  if  the 
equations  [1]~[7]  then  it  is  a  bounded  distributed  lattice,  and  if  the  equations  [1]  ~[9] 
then  it  is  De  Morgan  algebra. 

[Definition  2]  The  bounded  distributive  lattice  satisfying  [8],  [9]  and  [10]  is  called 
Kleene  algebra. 

That  is,  Kleene  algebra  is  De  Morgan  algebra  satisfying  [10]Kleene’s  laws,  where 
Kleene’ s  laws  are  weaker  conditions  of 

[10]’  The  Complementary  laws:  ~a  •  a-0,  ~a  v  a=l. 

The  above  complementary  laws  correspond  to  the  law  of  excluded  middle  and  the 
law  of  contradiction,  which  are  essential  parts  of  Boolean  algebra  or  two-valued  logic. 
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Table  1.  Axioms  of  Kleene  Algebra 

[1] Commutative  laws:  avb=b  v  a-— [1-1],  a»b=b»a -  [1-2] 

[2] Idempotent  laws:  a  v  a=a - [2-1],  a  •  a=a -  [2-2] 

[3] Absorption  laws:  a v  (a« b)=a - [3-1],  a»(avb)=a -  [3-2] 

[4]  Associative  laws:  a  v  (b  v  c)=(a  v  b)  v  c-[4-l] ,  a»  (b  »c)=(a»b)»  c —  [4-2] 

[5]  Distributive  laws:  a  •  (b  v  c)=(a  •  b)  v  (a*c)-[5-l],  a  V  (b  •  c)=(a  vb)  •  (av  c)-  [5-2] 

[6] The  least  element:  Ova=a - [6-1],  0  •  a=0 -  [6-2] 

[7] The  greatest  elements  v  a=l - [7-1],  1  •  a=a -  [7-2] 

[8] Double  negations  law:  ~(~a)=a - [8] 

[9] De  Morgan’s  laws:  ~(a  v  b)=  -a  •  ~b-  [9- 1  ],  ~(a  •  b)=  -a  v  ~b -  [9-2] 

[10]  Kleene ’s  lawsD  (~a  v  a)  v  (~b»b)=~av  a  -[10-1], 

(~a  v  a)  •  (~b  •  b)=~b  •  b—  [10-2] 

where  a,  be  [0,1]. 

[Theorem  1]  An  algebra  of  fuzzy  logic  <[0,1],  v ,  • ,  ~>is  Kleene  algebra. 

3  A  Finite  Model  Showing  an  Axiom  Being  Independent  from 
Others 

One  of  the  interesting  problems  concerning  the  above  axioms  of  Kleene  algebra  is 
whether  each  axiom  is  independent  or  not  from  the  others  in  the  set  of  axioms.  This 
problem  was  considered  firstly  by  M.  Mukaidono(7)  and  he  showed  a  set  of  axioms  of 
Kleene  algebra  in  which  each  axiom  is  independent  from  the  others.  Recently,  this 
problem  was  investigated  again  in  more  thoroughly  by  the  authors(12)  by  using  the 
method  of  indeterminate  coefficients01'.  Here  in  this  paper  we  will  explain  this  topic 
in  more  detail.  Before  that,  we  have  to  explain  what  is  the  method  of  indeterminate 
coefficients.  The  method  of  indeterminate  coefficients  was  developed  firstly  by  M. 
Goto(ll)  to  find  out  many- valued  truth  tables  for  undefined  operators  in  axioms  by  a 
computer.  In  axioms  of  an  algebra  described  by  equations,  in  general,  there  are  some 
operators  in  the  equations  such  as  v ,  • ,  In  the  method  of  indeterminate 
coefficients,  these  operators  are  regarded  as  undefined  and  the  axioms  are  regarded  as 
constrained  conditions  that  these  operators  should  be  satisfied.  As  an  output,  the 
algorithm  based  on  the  method  of  indeterminate  coefficients  gives  truth  tables  of 
operators,  that  is,  finite  models  of  the  algebra  that  satisfy  all  given  axioms.  In  finding 
finite  models,  at  first,  we  have  to  designate  the  number  N  of  elements  (corresponds  to 
the  number  of  truth  values  in  N  valued  logic)  in  the  model. 

[The  algorithm  based  on  the  method  of  indeterminate  coefficients] 

Input: 

(1)  N:  number  of  elements 

(2)  A  set  of  axioms 

Output: 

Truth  tables  of  undefined  operators  satisfying  all  given  axioms 
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The  method  of  indeterminate  coefficients  is  described  briefly  in  appendix.  If  you 
are  interested  in  the  algorithm  in  more  detail,  please  see  the  reference  (II)  or  (12). 

[Example] 

Input: 

(1)  N=3 

(2) [2-l]Idempotent  law:  ava=a 

Output: 

Table  2: 

Table  2.  a  v  b 

a  12  0 

b  | - 

1  |  1  *  * 

2  |*2* 

0  |**0 

In  the  above  example,  we  designate  the  number  of  truth  values  as  3,  that  is, 
{0,1,2},  and  a  set  of  axioms  as  only  one  equation  of  the  idempotent  law:  a  v  a=a.  We 
obtain  one  solution  as  shown  in  Table  2.  Here  please  notice  that  in  the  table  there  are 
symbols  *  which  means  don’t  care,  that  is,  any  element  of  {0,1,2}  is  admissible  in  *. 
In  this  sense,  the  number  of  solution  is  only  one  but  the  number  of  truth  tables  (finite 
models)  satisfying  the  idempotent  law  is  3x3x3x3x3x3=729  in  three-valued  because 
the  number  of  *  is  6  in  Table  2. 

By  using  the  method  of  indeterminate  coefficients  we  can  clarify  the  properties  and 
the  power  of  each  axiom  to  determine  the  solutions.  Especially,  we  can  examine 
whether  an  axiom  is  independent  or  not  from  the  other  axioms.  That  is,  at  first  we 
obtain  the  truth  tables  satisfying  the  other  axioms  and  finally  we  add  the  given  axiom 
and  obtain  the  truth  tables  again  satisfying  all  axioms.  If  the  number  of  truth  tables  is 
reduced,  it  is  proofed  that  the  given  axiom  is  independent  from  the  other  axioms  and 
one  of  the  disappeared  truth  tables  is  an  example  (a  finite  model)  that  shows  the  given 
axiom  is  independent  from  the  others.  Every  one  of  the  disappeared  truth  tables  is  a 
counter  example  showing  the  axiom  is  independent. 

3-1  Double  Negations  Law  Is  Independent 

At  first  we  show  that 

[8]DoubIe  negations  law:  ~(~a)=a - [8] 

is  independent  from  the  others.  In  Table  3,  the  number  of  solutions  satisfying  all 
axioms  before  the  axiom  including  it  self  is  listed  in  turn  when  N=3.  The  upper  line 
shows  the  axiom  number  and  the  lower  line  shows  the  number  of  solutions  that  satisfy 
the  axiom  with  all  axioms  before  that.  Although  the  order  of  axioms  to  be  examined  is 
appropriate  in  principle,  it  is  selected  such  that  the  number  of  solutions  does  not 
become  so  huge  on  the  way  in  practice.  Please  notice  again  the  numbers  in  the  lower 
line  are  of  solutions  and  not  of  truth  tables.  The  final  solution  is  obtained  as  only  one 
solution  which  is  a  truth  table  described  in  Table  4,  which  is  only  one  three  valued 
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model  of  Kleene  algebra.  In  Table  3  until  the  axiom  [10-2]  there  exist  10  solutions, 
but  at  the  axiom[8],  9  solutions  have  disappeared,  all  of  which  are  models  showing 
[8]  Double  negations  law  is  independent  from  the  others.  Table  5  is  one  example  of 
them,  that  is,  this  truth  table  satisfies  all  axioms  of  a  set  axioms  of  Kleene  algebra 
except  [8]  Double  negations  law. 

Table  3.  Solutions  of  [8]Double  negations  law  in  three  valued 
Axiom#  [6-1]  [7-1]  [9-1]  [3-1]  [2-1]  [4-1]  [5-1]  [10-1]  [1-1]  [1-2] 

N.ofsolut.  1  1  316  4350  3453  2033  287  173  107  25 

[4-2]  [3-2]  [5-2]  [2-2]  [9-2]  [6-2]  [7-2]  [10-2]  [8] 

24  10  10  10  10  10  10  10  1 

Table  4.  A  model  of  Kleene  algebra  in  Table  5.  A  model  showing  [8]  double 
three  valued  negations  law  is  independent 


a  1  2  0 

a  1  2  0 

a 

1  2  0 

a 

1  2  0 

1 

h  1 _ 

h  1 

h  1 

a| 

U  1 

dl 

D  1 

|1  1  1 

1  |  1  2  0 

1|0 

i  1 

1  1  1 

l  | 

1  2  0 

1|1 

|  1  2  2 

2  j  2  2  0 

2|2 

2  I 

1  2  2 

2  j 

2  2  0 

2|1 

12  0 

0  0  0  0 

o|i 

0  | 

1  2  0 

0  j 

0  0  0 

0|1 

av  b 

a«b 

~a 

avb 

a»b 

~a 

3-2  De  Morgan’s  Laws  Are  Independent 

To  show  one  of 

[9]De  Morgan’s  laws:  ~(avb)=  ~a*~b — [9-1] 

-(a  •  b)=  -a  v  ~b— [9-2] 

is  independent  from  others,  we  will  choose  [9-1]  as  the  last  axiom  and  apply  the 
algorithm  of  the  method  of  indeterminate  coefficients.  If  let  N=3,that  is  three  valued, 
then  we  can  get  the  result  shown  in  Table  6.  In  this  case  the  truth  table  is  decided 
uniquely  at  the  axiom  [1-2],  which  means  that  we  cannot  decide  [9-l]De  Morgan’s 
laws  is  independent  or  not  from  the  other  axioms  in  the  scope  of  three  valued.  In 
general,  even  if  the  number  of  solution  is  not  decreased  at  the  final  stage  in  N  valued, 
it  is  not  proof  that  the  last  axiom  is  not  independent  from  the  set  of  former  axioms 
because  we  have  a  possibility  to  find  out  a  counter  example  in  the  N+l  valued.  Indeed 
in  this  case  letting  N=4,  we  can  show  that  [9-l]De  Morgan’s  law  ~(avb)=  ~a  •  ~b  is 
independent  as  shown  in  Table  7.  An  example  of  truth  tables  disappeared  lastly  in 
Table  7  is  shown  in  Table  8,  which  is  a  model  showing  that  [9-l]De  Morgan’s  laws  is 
independent.  The  situation  is  same  if  [9-2]De  Morgan’s  law:  ~(a  •  b)=  ~av~b  is 
located  lastly  instead  of  [9-  l]De  Morgan’s  law  ~(a  v  b)=  ~a  •  ~b. 

Table  6.  Solution  of  [9-l]De  Morgan’s  law  in  three  valued 
Axiom#  [8]  [6-1]  [7-1]  [1-1]  [10-1]  [3-2]  [5-1]  [1-2]  [4-1]  [3-1] 

N.ofsolut.  4  4  4  12  52  5  10  1  1  1 

[2-1]  [4-2]  [5-2]  [2-2]  [10-2]  [6-2]  [7-2]  [9-1] 

1111  1  111 
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Table  7.  Solution  of  [9-l]De  Morgan’s  law  in  four  valued 
Axiom  #  [8]  [6-1]  [7-1]  [1-1]  [10-1]  [3-2]  [5-1]  [1-2]  [4-1]  [3-1] 

N.ofsolut.  0  10  10  640  9490  985  984  5  5  5 

[2-1]  [4-2]  [5-2]  [2-2]  [10-2]  [6-2]  [7-2]  [9-1] 
5555  5553 

Table  8.  A  model  showing  [9-l]De  Morgan’s  law  is  independent 


a 

h  i  - 

1 

2 

3 

0 

h 

a  1 

2 

3 

0 

1  1 

1 

1 

1 

1 

1 

i 

2 

3 

0 

1  |2 

2  I 

1 

2 

3 

2 

2 

1  2 

2 

2 

0 

2  |1 

3  1 

1 

3 

3 

3 

3 

1  3 

2 

3 

0 

3  |0 

0  1 

1 

2 

3 

0 

0 

|0 

0 

0 

0 

0  |3 

avb 

a*b 

~a 

3-3  Commutative,  Distributive  and  Kleenes,  Laws  Are  Independent 

In  the  similar  manner  we  can  show  that  each  one  of 


Commutative  laws:  a  v  b=b  v  a —  - [1-1] 

a  •  b=b  •  a - [1-2] 


Distributive  laws:  a •  (b  v c)=(a •  b)  v  (a«c) — [5-1] 

a  v  (b  •  c)=(a  v  b)  •  (a  v  c) — [5-2],  and 
Kleene’slaws  :  (~a  v  a)  v  (~b  •  b)=~a  v  a — [10-1] 

(~a  v  a)  •  (~B  •  b)=  ~b  •  b— [  10-2] 

is  independent  from  others,  respectively.  In  fact,  Table  9  is  a  model  showing  that  [1- 
l]Commutative  law  is,  Table  10  is  a  model  showing  that  [5-l]Distributive  law  is  and 
Table  11  is  a  model  showing  that  [10-l]Kleene’s  law  is  independent,  respectively. 
The  situations  are  same  for  [l-2]Commutative  law:a»b=b»a,  [5-2]Distributive  law: 
av(b»c)=(avb)*(avc)  and  [10-2]Kleene’s  law  (~a  v  a)  •  (~b  •  b)=~b  •  b, 
respectively. 


Table  9.  A  model  showing  [l-l]Commutative  law  is  independent 


a 

h  I- 

1 

2 

3 

0 

K 

a  1 

2 

3 

0 

a  | 
1|0 

°  i 
i  i 

1 

1 

1 

1 

l 

M 

2 

3 

0 

2  I 

1 

2 

2 

2 

2 

|1 

2 

3 

0 

2  |3 

3  1 

1 

2 

3 

0 

3 

1  3 

3 

3 

3 

3  |2 

o  1 

1 

2 

3 

0 

0 

|o 

0 

0 

0 

0|1 

avb 

a*b 

-a 
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Table  10  A  model  showing  [5-l]Distributive  law  is  independent 
a  1  2  3  4  0  a  1  2  3  4  0 


b 

1 

1 

1 

1 

1 

1 

1 

b 

1 

|-~ 

|1 

2 

3 

4 

0 

a  1 
1|0 

2 

1 

1 

2 

1 

1 

2 

2 

|2 

2 

0 

0 

0 

2  |4 

3 

1 

1 

1 

3 

1 

3 

3 

|3 

0 

3 

0 

0 

3  |3 

4 

1 

1 

1 

1 

4 

4 

4 

|4 

0 

0 

4 

0 

4  |2 

0 

1 

1 

2 

3 

4 

0 

0 

|0 

0 

0 

0 

0 

0|1 

avb 

a»b 

~a 

Table  11  A  model  showing  [10-1]  Kleene’s  law  is  independent 
a  1  2  3  0  a  1  2  3  0 


b  |- 

b 

1— 

— 

a  1 

1  | 

1  1 

1 

1 

1 

|1 

2 

3 

0 

1|0 

2  I 

1  2 

1 

2 

2 

1  2 

2 

0 

0 

2  |3 

3  1 

1  1 

3 

3 

3 

1  3 

0 

3 

0 

3  |2 

0  | 

1  2 
avb 

3 

0 

0 

|o 

0 

a«b 

0 

0 

0|1 

~a 

4  Independent  and  Complete  Axioms  of  Kleene  Algebra 

From  the  above  considerations,  we  can  see  that  in  the  set  of  axioms  of  Kleene  algebra, 
at  least,  the  following  axioms  have  to  be  included: 

(1) [8]Double  negations  law:  ~(~a)=a - [8] 

(2) One  of  [9]De  Morgan’s  laws:  ~(a  v  b)=  ~a*  ~b — [9-1] 

~(a  •  b)=  ~av~b— [9-2] 


(3)One  of  [l]Commutative  laws:  a  v  b=b  v  a - [1-1] 

a  •  b=b  •  a - [1-2] 


(4) One  of  [5  ]Distributi  ve  la  ws :  a  •  (b  v  c)=(a  •  b)  v  (a  •  c) — [5-1] 

a  v  (b  •  c)=(a  v  b)  •  (a  v  c) — [5-2] 

(5) Oneof  [10]  Kleene’s  lawsD  (~a  v  a)  v  (~b  •  b)=~a  v  a — [10-1] 

(~a  v  a)  •  (~B  •  b)=  ~b  •  b— [  10-2] 

Similarly  it  is  easily  shown  that 

(6) One  of  [6]The  least  element:  0va=a — [6-1]  0  •  a=0- — [6-2] 

[7 ]The  greatest  element:  lva=l — [7-1]  1  •  a=a — [7-2] 

is  independent  from  the  others  by  defining  1=~0 

It  was  shown  that  the  each  of  the  above  axioms  is  independent  each  other  in  the  set 
of  axioms  of  Kleene  algebra.  Next  question  is  what  is  the  complete  set  of  Kleene 
algebra.  For  obtaining  such  complete  set  of  axioms,  we  have  to  show  that  every 
axioms  of  Kleene  algebra  ([1-1]  ~[10-2])  is  derived  from  the  complete  set  of  axioms. 
To  show  that  an  axiom  is  derived  from  the  other  axioms  (that  is,  the  axiom  is  not 
independent  from  the  other  axioms),  we  cannot  use  the  method  of  indeterminate 
coefficients,  because  even  if  the  number  of  solutions  is  not  decreased  when  it  was 
located  as  last  axiom,  it  is  not  proof  the  axiom  is  not  independent  from  the  others  as 
described  in  Section  3-2.  The  method  of  indeterminate  coefficients  only  finds  the 
candidates  for  the  complete  set  of  axioms.  So,  we  have  to  show  formally  that  every 
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axioms  of  Kleene  algebra  can  be  derived  from  the  complete  set  of  axioms,  where  the 
above  six  axioms  are  candidates  for  the  element  of  the  complete  set  of  axioms.  In 
deed,  by  M.  Mukaidono'7’  it  has  been  shown  the  following  six  axioms  listed  in  Table 
12,  which  are  all  selected  among  the  above  six  candidates,  are  the  complete  axioms, 
which  are,  of  course,  also  independent  each  other,  of  Kleeen  algebra. 

Table  12  An  independent  and  complete  axioms  of  Kleene  algebra(7) 

[l]Commutative laws:  avb=bva - [1-1] 

[5] Distributive  laws:  a  •  (b  v  c)=(a  •  b)  v  (a  •  c) — [5-1] 

[6] The  least  element:  0  v  a=a — [6- 1  ] 

[8] Double  negations  law:  ~(~a)=a - [8] 

[9] De  Morgan’s  laws:  ~(a  v b)=  ~a* ~b — [9-1] 

[10]  Kleene ’s  laws:(~ava)v(~b«b)=~ava — [10-1] 

By  using  the  above  facts,  another  sets  of  independent  and  complete  axioms  of 
Kleene  algebra  were  reported  recently(12),  and  this  technique  is  applied  to  axioms  of 
Boolean  algebra  and  64  deferent  kind  sets  of  the  independent  and  complete  axioms  of 
Booean  algebra  are  discovered*13’. 


5  Finite  Models  of  Kleene  Algebra 

As  obtained  a  three-valued  mode  of 
Kleene  algebra  in  Table  4,  we  can 
derive  all  finite  models  satisfying  the 
given  set  of  axioms  by  using  the 
method  of  indeterminate  coefficients 
in  any  N  values  in  the  scope  of 
permitted  time.  In  figure  1,  are 
illustrated  Hasse  diagrams  of  all  finite 
models  of  Kleene  algebra  in  N=8  for 
an  example.  For  obtaining  these 
models  based  on  the  method  of 
indeterminate  coefficients  we  used  the 
set  of  axioms  of  Kleene  algebra  listed 
in  Table  12,  which  are  independent 
and  complete  axioms,  because  it  is 
easer  to  derive  all  models  if  the 
number  of  axioms  is  smaller. 


Fig.  1.  All  models  of  8-valued  Kleene  Algebra 


6  Conclusions 


The  properties  and  roles  of  each  axioms  of  Kleene  algebra  are  clarified  and  the 
independence  of  each  axiom  is  examined  through  the  method  of  indeterminate 
coefficients.  As  sub-products  we  can  show  that  a  set  of  independent  and  complete 
axioms  of  Kleene  algebra  and  all  finite  models  of  Kleene  algebra  in  some  finite  cases. 
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Appendix 

A  Brief  Description  of  the  Method  of  Indeterminate  Coefficients 

Let  f(x,y)  be  an  undefined  operator  appeared  in  a  set  of  axioms  {Si,— ,Sk, — ,Sn}. 

N- 1 

Then  we  can  write  f{x,  y)  =  v  Xu  •  x'  •  yJ  , 

i,j= o 
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where  x‘=l  if  x=i  and  x‘=0  if  x^i  and  Xkj  is  an  indeterminate  coefficient  and  will  takes 
a  value  of  {0,1, — .N-l}  if  the  undefined  operator  is  defined  uniquely,  but  in  the 
sequel  Xi,j(k,m)  (m=l,-,mk)  (i,  j=0,l, — ,N-1)  will  takes  a  subset  of  {0,1,— .N-l} 
where  X;j(k,m)  means  a  m-th  partial  solution  in  k-th  step. 

(1)  Xij(0,l)=*  ={0, — ,N-l}(i,  j=0,l, — ,N-1)  for  a  starting  partial  solution.  Let  m0=l 
and  k=l. 

(2)  Drive  all  partial  solutions  Xij(k,m)(m=l,~,mk)  satisfying  the  axiom  Sk  from  the 
initial  conditions  Xi,j(k-l,m)(m=l,-,mk.1) 

(3)  Repeat  the  above  step  (k=l, — ,n)  until  Sn. 

(4)  The  final  partial  solutions  are  the  general  solutions  satisfying  all  given  axioms 
{S„~,Sk, — ,S„}.and  if  mn=l  and  Xy(n,l)  takes  an  element  of  {0, — ,N-1}  for  all  i  and 
j,  then  the  truth  table  of  f(x,y)  is  determined  uniquely. 
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Abstract.  The  concept  of  Generalized  Quantifier  (GQ)  was  introduced 
to  query  languages  in  [9]  and,  independently,  in  [10].  The  present  pa¬ 
per  shows  how  GQs  can  be  used  in  Conceptual  Modeling,  specifically 
how  they  can  be  incorporated  into  Entity-Relationship  diagrams  ([4]) 
to  increase  their  expressive  power.  A  language  to  express  E-R  models  is 
defined  and  given  formal  semantics.  It  is  them  shown  how  GQs  can  be 
easily  added  to  this  framework.  Several  GQs  that  have  natural,  intuitive 
interpretations  in  the  context  of  conceptual  modeling  are  defined;  their 
use  is  shown  through  examples. 


1  Introduction 

Conceptual  modeling  is  one  of  the  most  important  steps  in  the  creation  of  an 
Information  System.  Modeling  is  a  notoriously  difficult  activity;  it  cannot  be 
treated  algorithmically,  and  it  requires  ingenuity  and  experience.  One  of  the 
main  tools  for  the  task  is  the  use  of  conceptual  models,  semiformal  specifications 
of  how  to  structure  and  express  information.  The  Entity-Relationship  (E-R)  mo¬ 
del  is  one  of  the  most  successful  models  ([4]);  it  is  simple,  intuitive,  yet  relatively 
powerful.  However,  its  limitations  are  well-known.  Many  developments  in  con¬ 
ceptual  modeling  assume  that  logic  methods  are  too  inflexible,  too  limited  and 
too  unintuitive  to  be  useful  for  modeling.  This  paper  is  a  starting  point  for  work 
that  counters  this  assumptions  and  gives  logic  a  place  in  conceptual  models.  In 
particular,  it  shows  how  to  overcome  many  of  the  limitations  of  the  E-R  model  by 
using  higher-order  operators  (intuitively,  relations  on  relations).  Reasoning  with 
higher-order  concepts  may  be  complex  from  both  a  computational  complexity 
point  of  view  and  a  conceptual  point  of  view.  We  propose  to  use  the  framework 
of  Generalized  Quantifiers  (GQs)  to  attack  the  problem.  GQs  are  declarative, 
powerful,  high-order  operators  that  have  a  natural  graphical  representation. 

In  the  next  section,  we  review  the  basics  of  Entity- Relationship  models,  give 
a  formal  description  for  them,  and  introduce  the  concept  of  GQ.  In  section  3  we 
show  some  of  the  problems  that  this  approach  is  trying  to  solve  by  giving  some 
examples  of  situations  in  which  information  is  hard  or  impossible  to  capture  in 
a  traditional  E-R  model.  In  section  4  we  formalize  E-R  models  and  extend  the 
formalization  with  a  selected  set  of  GQs;  we  give  examples  of  how  the  extension 
solves  the  problems  of  section  3.  Finally,  we  mention  some  related  work  and  close 
with  some  conclusions  and  comments  on  further  work. 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  IS  MIS  2000,  LNAI  1932,  pp.  321-330,  2000. 
(§)  Springer- Verlag  Berlin  Heidelberg  2000 
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2  Background 

In  this  section  we  give  some  preliminary  definitions  for  the  rest  of  the  paper.  For 
completeness,  the  next  subsection  briefly  introduces  the  basic  ideas  of  Entity- 
Relationship  models,  while  subsection  2.2  introduces  the  concept  of  Generalized 
Quantifier,  together  with  some  examples. 

2.1  Entity-Relationship  Models 

An  E-R  model  is  a  data  model  with  three  basic  concepts:  entities,  attributes  and 
relationships.  Entities  represent  things  either  real  or  conceptual.  They  denote 
sets  of  objects,  not  particular  objects;  in  this  respect  they  are  close  to  classes 
in  object-oriented  models.  The  set  of  objects  modeled  by  an  entity  are  called  its 
extension. 

Relationships  are  connections  among  entities.  The  arity  of  a  relationship  is 
the  number  of  entities  involved:  unary,  binary,  ternary,  and  so  on.  Binary  rela¬ 
tionships  are  the  most  common.  Two  kinds  of  constraints  are  associated  with 
relationships.  The  participation  constraint  tells  us  whether  all  objects  in  the  ex¬ 
tension  of  an  entity  are  involved  in  the  relationship,  or  whether  some  may  not  be. 
For  example,  entities  Office  and  Employee  may  have  a  relationship  works-on 
between  them.  If  all  offices  have  employees,  then  participation  of  Office  in 
works-on  is  total  (otherwise  is  partial).  If  all  employees  are  based  in  an  office, 
then  participation  of  Employee  is  also  total.  The  cardinality  constraint  tells  us 
how  many  times  an  object  in  the  entity’s  extension  may  be  involved  in  a  relati¬ 
onship,  and  allows  us  to  classify  the  relationship  as  one-to-one,  one-to-many 
and  many-to-many.  We  note  that  recursive  relationships  are  allowed:  they  re¬ 
late  one  entity  to  itself.  For  example,  the  relationship  Child-of  relates  the  entity 
Person  to  itself.  To  distinguish  the  ways  in  which  Person  participates  in  this 
relationship,  roles  are  added  to  the  entity  ( father  and  son,  in  this  example). 

Entities  and  relationships  have  attributes,  which  are  properties  with  a  value. 
Attributes  convey  characteristics  or  descriptive  information  about  the  entity  to 
which  they  belong.  Attributes  may  be  simple  or  composite,  single  or  multivalued, 
primitive  or  derived. 

2.2  Generalized  Quantifiers 

Generalized  Quantifiers  were  first  introduced  in  logical  studies  ([13],  [12]).  The 
concept  has  attracted  attention  lately  for  its  uses,  among  others,  in  query  lan¬ 
guages  ([9], [10])  and  other  languages  like  description  logics  ([3]). 

Given  a  set  M,  a  Generalized  Quantifier  (GQ)  on  M  is  a  relation  among 
subsets  of  relations  on  M. 

Definition  1.  Let  a  type  be  a  finite  sequence  of  positive  numbers,  which  will 
be  written  [k\ , . . . ,  kn } .  Then  a  generalized  quantifier  of  type  [&i , . . . ,  kn\  on  M 
is  an  n-ary  relation  between  subsets  of  Mkl , . . . ,  Mkn  (i.e.  between  elements  of 
V(Mkl )  x  ...  x  V(Mkn ) ). 
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Not  every  relation  between  subsets  of  the  domain  is  considered  a  GQ.  In¬ 
tuitively,  we  would  like  a  GQ  to  behave  as  a  logical  operator,  in  the  sense  that 
it  should  not  distinguish  between  elements  in  the  domain.  Thus,  many  authors 
pose  the  following  constraint  on  the  definition: 

Definition  2.  (PERM)  A  quantifier  Q  follows  PERM  if,  whenever  /  is  a  per¬ 
mutation  on  M,  then  Qm(A i,  ... ,  A„)  iff  QM(f[A i],. . .  ,f[An}). 

In  the  context  of  database  query  languages,  this  constraint  ensures  that  quanti¬ 
fiers  are  generic  operations  ([1]). 

The  following  are  examples  of  GQs.  A  universe  M  is  fixed.  We  use  Q  as 
a  variable  over  GQs,  and  write  Q(A\, . . . ,  An)  to  indicate  that  sets  A\, . . . ,  An 
belong  to  the  extension  of  Q,  i.e.  that  they  are  in  the  relation  denoted  by  Q. 
all  =  {X,Y  C  M\X  C  F} 
some  =  {X,YC  M\X  n  Y  ±  0} 
no  =  {I,PCM|Iny  =  0} 
at  least  n  —  {AT,  Y  C  M\\X  CtY\  >  n} 
at  most  n  =  {X,  Y  C  M\ \X  n  Y\  <  n} 

I  =  {X,  Y  C  M |  |  X  |=  |  y  |}  (Hartig’s  quantifier) 

Qr  =  {X  C  M\  |  X  |>|  M  —  X  |}  (Rescher’s  quantifier) 

H  =  {£  C  M4  |  3/  :  M  ->  M  3g  :  M  -s-  M  Va,b  e  M  <  a,  f(a),b,g{b)  >e  £} 
WR={IC  M,  RC  M2  \R  well-orders  X} 

All  the  above  quantifiers  are  of  type  [1,1],  except  QR  (type  [1]),  H  (type  [4]) 
and  WR  (type  [1,2]).  Note  that  the  first  five  quantifiers  are  first-order  definable; 
the  last  four  are  not. 

3  Limitations  of  the  E-R  Model 

E-R  models  have  a  layered  approach  to  organizing  information,  in  the  sense 
that  only  entities  and  relationships  can  have  attributes  and  only  entities  can  be 
involved  in  relationships.  Thus,  it  is  not  possible  for  attributes  to  have  attributes, 
or  to  be  involved  in  relationships;  and  it  is  not  possible  for  relationships  to 
be  involved  in  other  relationships.  This  results  in  limitations  on  what  can  be 
expressed  in  the  model.  We  call  limitations  due  to  the  first  rule  constraints:  they 
usually  express  some  condition  on  the  values  that  some  attribute(s)  can  take. 
We  will  not  deal  with  them  in  this  paper.  Here  we  concentrate  on  overcoming 
the  second  class  of  limitations  by  proposing  higher-order  operators  (intuitively, 
relations  on  relations).  To  give  an  idea  of  the  problems  that  E-R  models  face, 
we  give  several  examples  of  situations  in  which  the  model  is  not  able  to  capture 
necessary  information. 

One  such  situation  is  the  connection  traps :  let  E\,  Eo,  E3  be  entities  and 
Ri  C  E\  x  £2,  f?2  C  R2  x  R3  be  relationships.  The  connection  trap  problem  is 
that  of  inferring  properties  of  a  possible  connection  between  E\  and  £3  based 
on  R\  and  £2-  Sometimes  the  relationship  may  not  exist  at  all,  sometimes  the 
relationship  may  exist  but  it  is  not  determined  by  composing  R\  and  £2-  The 
following  examples  are  taken  from  [5], 
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Example  1.  Let  entities  Division,  Staff  and  Branch  be  related  through  rela¬ 
tionships  IsAllocated,  between  Division  and  Staff,  and  Operates,  between 
Division  and  Branch.  Participation  in  IsAllocated  is  total,  1  on  the  Division 
side  and  many  on  the  Staff  side  (i.e.  all  staff  members  are  allocated  to  one  and 
only  one  division,  and  each  division  is  allocated  one  or  more  staff  members).  Par¬ 
ticipation  in  Operates  is  total,  1  on  the  Division  side  and  many  on  the  branch 
side  (i.e.  all  divisions  are  assigned  one  or  more  branches,  and  all  branches  are 
assigned  to  one  and  only  one  division).  One  could  assume  that  staff  members 
work  at  particular  branches,  and  that  such  a  relationship  between  elements  of 
Staff  and  Branch  can  be  inferred  from  the  two  explicit  relationships.  However, 
we  note  that  both  relationships  are  1-M,  with  the  1  on  their  common  domain 
(Division).  Therefore  some  element  in  Staff  may  be  related  to  more  than  one 
branch,  even  if  one  one  would  expect  that  each  staff  member  works  in  only  one 
branch.  This  is  called  the  fan  trap  in  [5].  Another,  different  problem  is  called 
the  chasm  trap.  The  chasm  trap  would  come  up  if  any  of  the  two  relationships 
were  partial  instead  of  total.  Assume  the  situation  this  time  involved  entities 
Branch,  Staff  and  PropertyForRent,  related  as  follows:  Branch  and  Staff  are 
related  by  IsAllocated,  as  before,  and  Staff  and  PropertyForRent  are  related 
through  relationship  Oversees.  Participation  in  Oversees  is  partial  (i.e.  not  all 
people  in  the  staff  oversees  a  property  for  rent,  and  not  all  property  for  rent 
has  someone  from  Staff  assigned  to  oversee  it).  It  would  seem  that  one  could 
connect  each  branch  with  the  property  or  properties  that  someone  at  that  branch 
oversees.  However,  there  is  no  guarantee  that  all  branches  can  be  related  to  at 
least  some  property  for  rent  (since  participation  in  Oversees  is  partial). 


Example  2.  Assume  an  E-R  model  for  a  university.  The  model  contains  enti¬ 
ties  Teacher,  Class  and  Department.  There  are  relationships  Teaches  between 
Teacher  and  Class,  Of  f  eredBy,  between  classes  and  departments,  and  Faculty, 
between  teachers  and  departments,  with  the  obvious  interpretation.  A  university 
rule  is  that  teachers  can  only  teach  classes  offered  by  the  department  in  which 
they  are  faculty.  This  rule  cannot  be  enforced  in  the  E-R  model. 


Example  3.  Assume  an  E-R  model  for  a  company,  which  contains  entities  Client 
and  Representative.  The  representatives  are  employees  whose  mission  is  to 
interact  with  and  attend  to  clients;  a  business  rule  is  that  every  client  must  have 
a  representative,  and  that  every  representative  must  attend  to  several  clients. 
Note  that  this  relationship,  which  is  one-to-many  and  total,  induces  a  partition 
on  the  set  of  clients.  Many  properties  of  such  partitions  cannot  be  expressed  in 
the  E-R  model;  for  instance,  a  rule  stipulating  that  all  representatives  must  have 
the  same  number  of  clients  (i.e.  all  sets  in  the  partition  have  the  same  size). 


Example  4-  An  example  of  a  recursive  relationship  is  the  relationship  Manager  Of 
on  the  entity  Employee.  Two  roles  are  associated  with  Employee  through  this 
relationship:  manager  and  managee.  Such  relationship  is  partial  on  the  manager 
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role  (not  all  employees  are  managers)  and  total  on  the  managee  role  (all  em¬ 
ployees  have  a  manager).  This  relationship  has  several  properties  which  cannot 
be  expressed  in  the  model:  it  is  an  irreflexible  relationship  (no  one  can  be  his  or 
her  own  manager).  In  most  situations,  it  will  also  be  asymmetric  (if  employee  a 
is  the  manager  of  employee  b,  it  cannot  be  that  employee  b  is,  in  turn,  a  mana¬ 
ger  of  a)  and  transitive  (managers  have  higher-level  managers,  and  so  on).  Such 
information  can  be  used  by  a  system  to  check  insertions  in  the  relationship  for 
correctness,  but  cannot  be  represented  in  a  E-R  model. 

4  Extending  the  E-R  Model 

We  formalize  E-R  models  in  a  general  framework  that  will  allow  the  integration 
of  GQs.  First  we  define  signatures  and  give  a  formal  semantics  to  E-R  models. 
We  then  show  how  to  extend  the  model  with  GQs  and  define  several  GQs  which 
are  useful  in  extending  the  semantic  expressivity  of  E-R  models  in  the  context 
of  conceptual  modeling.  Finally,  we  give  several  examples  of  how  our  extension 
deals  with  the  problems  introduced  in  the  previous  section. 

4.1  Formalization  of  E-R  Models 

Usually  an  E-R  model  is  displayed  graphically  by  an  E-R  diagram.  An  E-R  dia¬ 
gram  is  a  graph  where  entities  are  represented  by  nodes,  relationships  (including 
IS-A  relationships)  are  represented  by  edges;  attributes  are  depicted  next  to  the 
entity  or  relationship  they  belong  to.  There  are  variations  between  authors  in 
the  way  the  information  is  represented  graphically.  In  order  to  be  able  to  define 
our  extensions  of  E-R  models  without  depending  on  a  particular  graphical  re¬ 
presentation,  and  to  develop  a  rigorous  framework,  we  define  a  formal  language 
in  which  to  express  the  model.  We  use  a  lisp-like  syntax,  with  expressions  always 
in  balanced  parenthesis.  In  the  following,  +  means  one  or  more,  and  [A]  means 
A  is  optional.  Thus,  (<attr-name>)+  means  a  list  of  one  or  more  of  the  objects 
of  type  <attr-name>. 

Definition  3.  A  signature  S  is  a  triple  <  £,1Z,A  >,  where  S  (denoted  by 
ent(S))  is  a  set  of  entity  names;  TZ  (denoted  by  rel(S))  is  a  set  of  relations¬ 
hip  names;  and  A  (denoted  by  att(S))  is  a  set  of  attribute  names.  Each  set  is 
disjoint  from  the  other  two. 

Definition  4.  An  E-R  model  D  for  a  signature  S  is  a  set  of  sentences,  where 
each  sentence  is  either 

—  of  the  form  (E  <entity-name>  (<attr-name>+));  with  each<entity- 
name>  e  ent(S)  and  each  <attr-name>  6  attr(S);  or 

—  of  the  form  (R  <relationship-name>  (<entity-name> : <role><part- 
constraint>  <card-constraint>)+  (<attr-name>)+),  where  <part- 
constraint>  is  one  of  total  or  partial,  <card-constraint>  is  one  of  1 
or  M,  each<relationship-name>  €  rel(S),  each  <entity-name>  €  ent(S), 
and  each  <attr-name>  £  attr(S);  or 
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—  of  the  form  (A  <attr-name>  <attr-type>  [(<attr-name>)+] ),  where 
each  <attr-name>  £  attr(S),  <attr-type>  is  one  of  simple  or  complex 
plus  one  of  single  or  multivalued,  and  the  optional  list  of  attribute  names 
is  used  if  the  attribute  is  complex^ . 

Definition  5.  Given  signature  S,  diagram  D,  for  each  r  £  rel(S),  comp(r)  C 
ent(S)  is  the  set  of  entities  involved  in  the  relationship  denoted  by  r,  that  is 
{e1; . . .  ,en  |  (R  r  (e!  :  pi  cj)  ...  (en  :  rn  pn  cn)  (attri))  £  D},  where  each 
Ti  is  a  role,  each  p*  a  participation  constraint,  each  Cj  a  cardinality  constraint 
and  each  attrj  a  set  of  attributes,  for  1  <  i  <  n. 

Definition  6.  Given  signature  S,  diagram  D,  for  each  a  £  attr(S),  edom(a )  = 
{e  £  ent(S)  |  (E  e  attr)  £  D  A  a  £  attr},  where  attr  is  a  list  of  attribute 
names,  and  rdomfa )  =  {r  £  rel(S )  |  (R  r  el  attr)  £  D  A  a  £  attr},  where  el  is 
a  list  of  entity  components  and  attr  is  a  list  of  attribute  names.  Thus  edom(a) 
is  the  list  of  entities  e  such  that  a  is  an  attribute  of  e,  and  rdom(a)  is  the  list  of 
relationships  r  such  that  a  is  an  attribute  of  r 2 . 

We  give  formal  semantics  to  the  language  by  defining  a  conceptual  structure 
as  follows. 

Definition  7.  Given  an  E-R  model  V  over  signature  S,  a  conceptual  structure 
is  a  tuple  <  M,V,I  >,  where  M  and  V  are  disjoint,  nonempty  sets;  and  I  is 
an  interpretation  function  from  the  elements  of  S  to  M  UV  with  the  following 
characteristics: 

—  For  each  e  £  ent(S),  1(e)  C  M. 

—  For  each  r  £  rel(S),  such  that  comp(r)  =  {ei, . . . ,  en],  I(r)  C  I(e{)  x  . . .  x 
I(en),  and 

—  if  participation  constraint  is  total  for  entity  ei  £  {ei,...,e„}  = 
comp(r),  then  r[ef\  =  I(ei)3,  and 

—  if  cardinality  constraint  is  1  for  entity  e*  £  {ej , . . . ,  en}  —  comp(r),  then 
r  <  ei  >  is  a  function1. 

—  For  each  a  £  att(S),  I  (attr)  is  either 

—  a  function  f  :  M  —t  V,  such  that  for  each  e  £  edom(a)  1(e)  C  dom(f), 
or 

—  a  function  f  :  Mn  — >  V,  such  that  for  each  r  £  adom(a),  then  I(r)  C 
dom(f). 

1  For  simplicity  we  will  assume  from  now  on  that  all  attributes  are  simple  and  single. 
This  simplifies  notation  while  not  subtracting  anything  substantial  from  the  model. 

2  In  the  following,  we  will  assume  for  simplicity  that  every  attribute  applies  only  to 
entities  or  to  relationships.  Again,  this  simplifies  notation  without  affecting  expres¬ 
sive  power. 

3  r[ei]  =  { Xi  £  I(ef)  \  Ai<jyi<n  £  Hej)  r(xi,  ■  ■ .  ,x, . . .  that  is,  the  elements 

in  the  extension  of  e;  that  are  related  to  other  elements  by  relationship  r. 

r  e,  1  a ,  ^  Cl, . . . ,  ei — i , 62+1 ,  -  -  - ,  e.jj  Cl , . . . ,  ei — 1 ,  ei,  , . . . ,  £ 

/(r)},  that  is,  the  binary  function  obtained  from  I(r)  by  considering  entity  a  and 
the  combination  of  values  that  ei  is  related  to  by  relation  r. 
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Intuitively,  M  is  the  domain  of  objects  which  we  are  trying  to  model,  while  V 
is  a  set  of  values5;  each  entity  e  is  assigned  by  the  structure  a  set  of  objects  1(e) 
as  its  extension;  likewise,  relationships  are  assigned  relations  over  the  entities 
involved  in  the  relationship,  and  attributes  are  considered  functions  relating 
values  to  entities  and  relationships. 

4.2  Extensions  of  the  Model 

It  has  become  common  to  extend  E-R  models  with  ideas  from  object  oriented 
models,  like  IS-A  (class/subclass)  relationships  (see  [14]  for  a  developed  (and 
complex!)  example  of  an  object  oriented  modeling  framework).  Our  formal  lan¬ 
guage  for  E-R  models  did  not  have  notation  for  IS-A  relationships,  or  other 
extensions.  Our  strategy  is  to  start  with  a  very  simple  E-R  model,  to  incorpo¬ 
rate  GQs  into  the  model  and  use  their  power  to  model  the  extensions  that  are 
needed  to  increase  the  modeling  power  of  the  initial  model. 

Definition  8.  Given  a  conceptual  structure  <  M,  V,  I  >,  an  extended  conceptual 
structure  is  a  tuple  <  Q,  M,  V,  I  >,  where  M,V,I  are  as  before  and  Q  is  a  set 
of  GQs  defined  over  M. 

Thus,  the  GQs  defined  may  have  as  arguments  entities  (which  are  represented 
by  sets  in  the  structure)  and  relationships  (which  are  represented  by  relations 
in  the  structure).  The  obvious  issue  is  to  find  a  set  of  GQs  that  will  be  helpful 
in  the  conceptual  modeling  task.  We  introduce  several  such  quantifiers  next.  In 
the  following  definitions,  it  will  be  assumed  implicitly  that  all  sets  and  relations 
come  from  M. 

First  we  note  a  form  of  useful  quantification  is  in  the  form  of  structural  (ty¬ 
peless)  quantifiers.  Since  GQs  can  capture  relationships  among  sets,  they  can 
capture  the  semantics  of  IS-A  relationships,  which  usually  correspond  to  sim¬ 
ple  set  inclusion.  We  point  out,  however,  that  there  is  more  information  about 
class/subclass  relationships  than  mere  inclusion.  Advanced  models  classify  IS- 
A  relationships  at  least  along  two  dimensions;  first,  disjointness/ overlap  of  the 
subclasses  (i.e.  whether  the  subclasses  are  allowed  to  have  any  common  elements 
-this  corresponds  to  inclusive  and  exclusive  choices),  and  second,  coverage  of  the 
superclass  by  the  subclasses  (i.e.  whether  the  superclass  is  the  union  of  the  sub¬ 
classes  or  not).  These  variants  can  all  be  captured  by  generalized  quantification: 

Ql(A,  Ai, .  .  . ,  An )  =  {Aigl,...,n  At  C  A} 

Qic(A,  Ai, . . . ,  An)  =  { A»ei,...,n  Ai  C.  A  A  U»ei,...,n  Ai  =  A} 

Qid(A ,  Ai,  .  .  .  ,  An)  =  {Aiel,...,n  Aj  C  A  A  A*,jei1...,n)»5fj  Aj  H  Aj  =  0  } 
QlDc{A,  Ai,  .  .  .  ,  An)  =  { Aiei,...,n  Ai  c  A  A  Ai,j6r,...,n,»^'  A%  n  Aj  =  0  A 
UiGl,...,n  Ai  =  A} 

Clearly,  Qi(A,  Ai, . . . ,  A„)  indicates  that  Ai,...,A„  are  subclasses  of  A, 
with  not  further  restrictions;  Qjc  further  constraints  the  relationship  so  that 
Ai, . . . ,  An  cover  A  (that  is,  all  the  elements  in  A  are  in  one  of  the  subclasses); 

5  Separating  values  for  attributes  from  the  entities  makes  the  model  simpler  and  agrees 
with  standard  practice  in  building  (semantic)  data  models  ([11]). 
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QiD)  on  the  other  hand,  constraints  the  relationship  so  that  all  subclasses  are 
disjoint.  Finally,  Qwc  adds  both  constraints6. 

This  does  not  give  us  more  expressive  power  than  we  already  had  in  some 
extensions  of  E-R  models.  However,  it  is  easy  to  define  GQs  that  give  well-known 
properties  of  relations  which  are  not  expressible  in  E-R  models,  even  extended 
ones.  A  few,  simple  examples  are: 

Qr(R)  =  {HC  M 2  |  R  is  reflexive} 

Qt(R )  —  {R  Q  M2  |  R  is  transitive} 

Qs(R)  =  {R  Q  M2  |  R  is  symmetric} 
and  their  respective  negations,  Qnr,  Qnt  and  Qns7- 

The  above  GQs  involve  only  one  relation.  More  complex  examples  may  in¬ 
volve  more  than  one  relationship,  and  determine  whether  they  can  be  composed 
or  not;  the  following  is  called  pseudo-transitivity,  for  the  obvious  reasons: 
Qpt(Ri,R2,R3)  =  {Ri  C  M2,  i?2  C  M2,R3  C  M 2  |  Vx,y,zRi(x,y)  A  R2(x,z) 
-» R3(y,z)} 

The  following  is  called  compositionality,  as  it  determines  when  it  is  possible  in 
principle  to  compose  two  relations  with  a  common  entity: 

Qc(A,B,C,R±,R2)  =  {A,B,C  C  M,  Ri  C  Ax  B,  R2  C  B  x  C  \  Ri[B]  = 

MB]} 

The  following  is  called  functionality  as  it  states  that  two  relations  not  only  can 
be  composed,  but  furthermore  their  composition  is  a  function: 

Qf{Ri,R2)  =  {R1,R2  CM2  |  Vx  My  llzR^y)  A  i?2(y,z)}8 

Finally,  some  GQs  may  involve  relations  and  sets  (relationships  and  entities) 
and  express  some  properties  that  the  relation  determines  on  the  set.  As  stated 
above,  any  time  there  is  a  one-to-many  relationship  R  on  entities  E\,E2,  the 
relationship  on  the  many  side  (say  E2)  has  a  partition  created  on  it  by  R  (strictly 
speaking,  this  only  happens  if  participation  of  E2  in  R  is  total;  otherwise,  a  set  of 
elements  not  in  R[E2]  must  be  considered).  Different  properties  of  such  partition 
can  be  expressed  by  GQs9: 

Qpc(A,B,R)  —  {A,B  C  M,R  C  A  x  B  \  My,z  G  B  |({x  |  R(x,y)}\  =  |{rr  | 

*(*,*)}  I)} 

We  show  the  applicability  of  these  GQs  by  using  them  to  solve  the  problems 
introduced  before. 

Example  5.  Recall  example  1  about  connection  traps.  Q/ (Is Allocated, 
Operates)  states  that  the  relationships  can  be  composed  in  a  functional  manner. 
Thus,  fan  traps  are  avoided.  Qc(Branch,  Staff ,  PropertyForRent,  IsAllocated, 
Oversees)  states  that  the  extensions  of  the  relationships  coincide  on  their  com¬ 
mon  entity.  Thus,  chasm  traps  are  avoided.  Note  that  Qc  does  not  constraint 
the  relationships  to  be  total  or  partial,  leaving  the  analyst  free  to  combine  this 
and  other  properties. 

6  In  the  quantifier  name,  I  stands  for  inheritance,  C  for  cover  and  D  for  disjointness. 

7  In  the  quantifier  name,  N  stands  for  not,  as  in  NR  for  not  reflexible,  and  so  on. 

8  The  notation  3 \z  is  a  shortcut  for  there  exists  a  unique  z.  Note  that  this  condition 
is  less  restrictive  than  asking  that  both  Ri  and  R2  are  functional. 

9  \A\  denotes  the  cardinality  of  set  A. 
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Example  6.  Recall  example  2  about  teachers,  classes  and  departments.  Then 
Qpt (Teaches,  Of  f  eredBy,  Faculty)  enforces  the  restriction  that  every  professor 
can  only  teach  classes  offered  by  the  department  where  (s)he  is  faculty,  as  desired. 


Example  7.  Recall  example  3  about  clients  and  representatives.  Then 
Qp(Client, Representative, Attends)  states  that  all  representatives  attend 
the  same  number  of  clients. 


Example  8.  Recall  example  4  about  the  recursive  relationships  ManagerOf  on 
the  entity  employee.  We  can  express  the  desired  properties  of  ManagerOf  now 
as  follows:  Qnr (ManagerOf),  Qns (ManagerOf),  Qp (ManagerOf ). 

5  Related  Work 

The  seminal  paper  ([4])  introduced  Entity-Relationship  modeling.  Although  wi¬ 
dely  used  because  of  its  balance  of  simplicity  and  expressive  power,  the  limitati¬ 
ons  of  the  model  have  been  noted  for  quite  some  time  and  given  rise  to  several 
extensions:  [2]  proposed  to  add  facilities  to  model  the  concept  of  transaction ;  [6] 
proposed  the  addition  of  data  types;  [7]  propose  the  addition  of  more  functiona¬ 
lity  by  describing  entity  behavior.  The  paper  [8]  adds  several  powerful  concepts, 
including  abstract  data  types,  arbitrarily  complex  structures,  and  defines  a  po¬ 
werful  query  language  for  E-R  diagrams.  It  must  be  pointed  out  that  not  all 
the  cited  work  provides  a  formal  foundation  in  the  form  of  well-defined,  formal 
semantics  ([8]  is  an  exception).  Our  work  is  different  in  that  we  extend  E-R  dia¬ 
grams  as  little  as  possible,  by  introducing  only  one  new  category  (that  of  GQ), 
while  at  the  same  time  capturing  a  rich  class  of  semantic  information  which  is  of 
interest  for  conceptual  modeling  (i.e.  our  goal  is  not  to  define  a  query  language 
or  data  types,  but  to  assist  the  analyst  in  capturing  mode  domain  information, 
therefore  helping  to  restrict  possible  interpretations  of  the  model).  By  defining 
a  formal  language  and  giving  a  formal  semantics  to  a  very  basic  E-R  model,  the 
work  presented  here  is  independent  of  notational  variations  and  extensions  of 
the  model,  and  has  a  formal  semantics. 


6  Conclusion  and  Further  Research 

This  paper  is  a  starting  point  for  work  that  uses  logical  methods  in  conceptual 
modeling.  It  was  argued  that  GQs  are  a  good  fit  for  conceptual  modeling  as 
they  are  declarative  and  high-level.  They  are  also  an  extremely  rich  and  powerful 
category;  the  challenge  is  to  define  relevant  sets  of  GQs  for  the  goal  at  hand. 
We  note  that,  even  though  we  have  worked  with  a  formal  language  for  several 
reasons,  to  incorporate  GQs  into  E-R  diagrams  is  easy  because  GQs  have  an 
intuitive  graphic  depiction,  as  it  was  shown  in  [15],  Therefore,  modelers  could 
actually  work  with  a  diagrammatic  representation  of  the  ideas  introduced  here. 
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Some  issues  that  deserve  further  attention  include  reasoning  with  GQs.  Some 
sort  of  limited  deduction  may  allow  analysts  to  check  properties  of  the  model; 
introductory  work  has  already  been  carried  out  ([16], [17])  but  it  does  not  seem 
to  be  well  known  outside  logical  circles.  We  hope  the  present  work  will  help 
disseminate  potentially  helpful  work  from  pure  logic  to  more  applied  enterprises. 
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Abstract.  Hybrid  knowledge  representations  that  combine  description  logics  with 
logic  programs  are  considered.  Previous  works  combine  description  logics  with 
Horn  logic  programs.  In  this  paper,  the  expressive  power  of  such  hybrid  systems  is 
extended  by  allowing  the  combination  of  function-free,  non-recursive  stratified 
logic  programs  with  description  logics.  Two  model-theoretic  definitions  for  the 
semantics  of  the  hybrid  knowledge  representation  are  presented.  It  is  shown  that 
the  inference  problem  based  on  the  second  semantics  is  decidable.  When  the  logic 
program  is  Horn,  the  two  semantics  defined  in  this  paper  coincide  with  the  seman¬ 
tics  in  [4]  with  regard  to  the  inference  problem. 


1.  Introduction 

The  use  of  hybrid  representations  is  an  important  research  problem  in  knowl¬ 
edge  representation  and  reasoning.  Several  recent  papers  [2,4,5, 6]  addressed  various 
aspects  of  hybrid  knowledge  representations.  In  [6],  a  framework  based  on  unifica¬ 
tion  of  constraint  logic  programming,  annotated  logic  programming,  and  stable 
model  semantics,  was  developed  to  handle  the  multiple  modes  of  reasoning  in  hybrid 
knowledge  bases.  The  work  in  [2]  investigated  the  connections  between  description 
logics  and  predicate  logics,  and  compared  their  relative  expressiveness.  Combina¬ 
tions  of  description  logics  with  Horn  logic  programs  were  investigated  [3,4],  In  par¬ 
ticular,  it  was  shown  in  [4]  that  the  inference  problem  is  decidable  in  a  hybrid  system 
combining  the  description  logic  ALCNR  with  a  non-recursive  Horn  logic  program. 

A  natural  extension  in  similar  direction  is  to  consider  the  integration  of 
description  logics  with  stratified  logic  programs  for  knowledge  representation.  Strat¬ 
ified  logic  programs  are  extensions  of  Horn  logic  programs  which  allow  certain 
restricted  form  of  negations  in  the  antecedent  of  program  rules,  and  thus  are  more 
expressive.  The  ability  to  combine  stratified  programs  and  description  logics  within 
one  hybrid  system  significantly  enhances  the  system’s  expressive  power. 

In  this  paper,  we  present  such  an  extension.  The  proposed  hybrid  system  com¬ 
bines  the  description  logic  ALCNR  with  a  stratified  logic  program.  We  present  two 
model-theoretic  semantics  for  the  hybrid  knowledge  representation  system:  The  pre¬ 
ferred  model  semantics,  and  the  preferred-canonical  model  semantics.  We  show  that 
the  inference  problem  under  the  preferred-canonical  model  semantics  is  decidable 
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based  on  a  straight  forward  application  of  the  decidability  result  in  [4]. 

2.  Preliminaries 

In  the  new  hybrid  representation,  a  knowledge  base  A  =  {T,n)  consists  of  two 
components:  a  terminology  T  in  the  description  logic  ALCNR,  and  a  logic  program  n 
which  is  function-free,  non-recursive,  and  stratified.  The  concepts  and  roles  from  the 
terminology  T  can  occur  (positively)  in  the  antecedents  of  rules  in  n. 

2.1.  The  Terminological  Component 

A  description  logic  language  contains  a  set  of  unary  relations  called  concepts 
that  represent  sets  of  objects  in  the  domain  of  discourse,  and  binary  relations  called 
roles  that  represent  relationships  between  these  objects.  Composite  formulas  in  a 
description  logic  are  built  from  primitive  concepts  and  roles  by  using  a  set  of  con¬ 
structors  in  the  logic.  Here  as  in  [4],  the  description  logic  component  in  a  hybrid 
knowledge  base  can  be  any  subset  of  the  ALCNR  language.  Descriptions  in  ALCNR 
are  built  in  the  following  way:  Each  primitive  concept  A  is  a  concept  description;  the 
special  concepts  J  (truth)  and  1  (falsity)  are  concept  descriptions;  let  C  and  E  be 
concept  descriptions  and  let  R  be  a  role  description,  then  Cn£,Cu£,  ->C,  E  R.C, 
VR.C,  (>  n  R)  and  (<  n  R)  are  concept  descriptions;  each  role  description  R  is  of  the 
form  Rl  n  ...  n  Rm  where  each  Rj  is  a  primitive  role. 

The  general  terminological  part  of  a  terminology  T  is  a  set  of  sentences  in 
ALCNR,  where  a  sentence  is  either  a  concept  definition,  a  concept  inclusion,  or  a  role 
definition.  A  concept  definition  is  of  the  form  C  :=  E  where  C  is  a  concept  name  and 
£  is  a  concept  description.  A  concept  inclusion  is  of  the  form  Cc£  where  both  C 
and  E  are  concept  descriptions.  A  role  definition  is  of  the  form  P  :=  R  where  P  is  a 
role  name  and  R  is  a  role  description.  We  do  allow  recursive  concept  definitions. 
The  assertional  part  of  T  is  a  set  of  ground  atoms  of  the  form  C(a),  R(a,  b). 

The  meaning  of  a  terminology  T  is  determined  by  a  model- theoretic  semantics. 
We  define  an  interpretation  /  to  be  a  non-empty  domain  O,  a  mapping  from  the  set  of 
constants  in  T  to  O  such  that  a1  ^  b'  if  a  ±  b,  a  mapping  from  each  concept  name  C 
to  a  unary  relation  Cl  in  O,  and  a  mapping  from  each  role  name  R  to  a  binary  rela¬ 
tion  R1  c  OxO.  The  mappings  of  I  can  be  naturally  extended  to  the  composite 
descriptions  in  a  straight  forward  way. 

An  interpretation  I  satisfies  a  concept  instance  C(a)  if  a1  e  C1 .  I  satisfies  a 
role  instance  R(a,  b)  if  (a1 ,  b1)  e  R1 .  I  satisfies  a  concept  definition  C  :  =  E  if  C1  = 
E1 ,  it  satisfies  a  concept  inclusion  C  c  E  if  C1  c  E1 ,  and  it  satisfies  a  role  definition 
P  :=  R  if  P1  =  R1 .  I  is  a  model  of  a  terminology  T  if  1  satisfies  each  sentence  in  T. 

2.2.  Stratified  Logic  Programs 

A  function-free  (normal)  logic  program  is  a  set  of  program  rules  of  the  form 
r:  BfiX,)  a  ...  a  Bm(Xm)  a  -C,(T,)  a  ...  a  -C„(T„)  -4  A(X).  (1) 

Here  each  Bh  Cj  and  A  is  an  atom,  m,  n  >  0.  The  conjunction  on  the  left  hand  side 
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of  the  arrow  »"  is  called  the  antecedent  (body)  of  the  rule,  and  A(X)  is  the  conse¬ 
quent  ( head)  of  the  rule.  X,  X,  and  Yj  are  tuples  of  variables  and  constants.  In  this 
study  we  consider  only  safe  programs,  i.e.,  each  variable  appearing  in  the  head  of  r 
or  in  a  negative  literal  in  the  body  of  r  must  occur  in  a  positive  literal  in  the  body  of 
r.  Moreover  we  consider  only  non-recursive  programs. 

An  interpretation  1  for  a  program  n  consists  of  a  non-empty  set  D  (the  domain 
of  I),  a  mapping  from  the  constants  in  n  to  D  such  that  a1  2  b1  if  a  *  b,  and  a  map¬ 
ping  from  each  n-ary  predicate  P  to  a  subset  of  Dn.  Here  an  n-tuple  a  e  P1  for  the 
n-ary  predicate  P  means  I  assigns  P(a )  to  be  "true".  For  a  fixed  domain  D,  one  can 
equivalently  replace  a  program  n  by  the  (possibly  infinite)  set  of  ground  rules  which 
are  obtained  from  n  by  substituting  the  variables  in  each  rule  r  by  the  objects  in  D. 
Let  us  call  this  instantiated  program  nD.  An  interpretation  1  is  said  to  be  a  model  of 
a  program  n  if  /  satisfies  the  instantiated  program  nD. 

In  logic  programming,  the  class  of  stratified  programs  [1,8]  received  a  lot  of 
attention.  A  normal  logic  program  n  is  said  to  be  stratified  if  there  is  a  stratification 
2  which  partitions  the  predicates  in  n 

2  =  Si  u  S2  u  ...  u  Sk 

such  that  for  each  rule  r  of  the  form  (1)  in  n,  we  have  stratum (B,)  <  stratum!  A)  for  1 
<  i  <  m,  and  stratum(C;)  <  stratum(A)  for  1  <  j  <  n.  Here  stratum(g)  =  j  if  q  e  Sj. 
From  a  stratification  X  of  n,  we  can  equivalently  define  a  partition  of  the  rules  in  n 

n  =  n2'~-> ...  u  nk 

such  that  the  rules  with  head  atom  in  Sj  form  the  set  jtj. 

Without  loss  of  generality,  we  assume  in  this  paper  that  the  stratification  con¬ 
sidered  for  a  program  n  is  the  tightest  stratification  in  the  sense  that  for  each  atom  A 
in  Sj  ( j  >  2),  there  exists  an  atom  q  in  S j  A  such  that  ~^q  occurs  in  the  body  of  a  rule 
with  A  as  the  rule  head.  We  remark  that  in  case  some  concepts  and  roles  from  the 
terminological  part  T  appear  in  the  body  of  rules  in  the  stratified  program  n,  these 
concepts  and  roles  will  always  belong  to  in  the  tightest  stratification  of  n.  Note 
that  U\  may  be  empty  for  the  tightest  stratification  (see  Example  1). 

Let  n  be  a  stratified  program  with  stratification  X  =  u  S2  u  ...  (J  Sk.  Let  M, 
N  be  two  models  of  n  based  on  the  same  domain  D.  We  say  that  M  is  more  prefer¬ 
able  to  N,  written  as  M  <  N,  if  for  each  ground  atom  P(a)  e  M  -  N  (P(a)  =  true  in 
M,  and  P(a)  =  false  in  N),  there  is  an  atom  Q(P)  e  N  -  M  such  that  stratum(Q)  < 
stratum(P).  We  write  M  <  N  if  M  <  N  and  M  ^  N.  Clearly,  for  a  stratified  program 
n,  the  preference  relation  "<"  is  a  partial  order.  A  model  M  is  said  to  be  a  most  pre¬ 
ferred  model  of  n,  if  there  is  no  model  N  of  n  such  that  N  <  M.  Such  a  most  pre¬ 
ferred  model  is  called  a  perfect  model  by  Przymusinski  [7]. 

In  this  paper,  we  always  consider  perfect  models  of  a  stratified  program  to  be 
its  designated  models.  Moreover,  whenever  we  discuss  the  models  of  a  program  n 
with  respect  to  a  fixed  domain  D,  we  will  always  replace  n  by  its  instantiated  pro¬ 
gram  nD,  which  is  equivalent  to  a  propositional  logic  program.  It  has  been  shown 
that  a  stratified  propositional  program  has  a  unique  perfect  model  which  can  be 
obtained  iteratively  using  iterative  Horn  program  consequences  and  program  reduc¬ 
tion. 
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Let  n  be  a  propositional  program  with  negations  allowed  in  the  body  of  rules. 
Let  7  be  a  partial  interpretation  of  the  form  I  =  { Pos ,  Neg)  where  Pos  is  the  set  of 
atoms  assigned  to  be  true  in  7,  Neg  is  the  set  of  atoms  assigned  to  be  false  in  7,  Pos 
n  Neg  =  0.  The  atoms  occurring  in  n  but  not  in  Pos  u  Neg  are  undefined  in  7.  The 
reduction  of  n  w.r.t.  7  is  a  program  nil  which  is  obtained  from  n  by  the  following 
two  operations: 

( 1)  For  each  rule  of  the  form 

7?[  A  ...  A  Bm  A  “i C\  a  ...  a  -iC„  — >  A, 
delete  the  rule  if  some  S,  e  Neg  or  C;  e  Pos. 

(2)  For  each  remaining  rule,  remove  from  the  body  any  occurrence  of  literals 
whose  atom  occur  in  Pos  u  Neg. 

Example  1.  Consider  the  hybrid  knowledge  base  A  =  ( T ,  n).  The  terminolog¬ 
ical  part  T  has  the  following  sentences: 

author  :=  3  write.paper 
author  c  student  u  professor 
student  n  professor  c  L 
author(John) 

The  concepts  student,  professor,  and  paper  are  primitive  ones,  whereas  the  concept 
author  is  a  derived  concept.  The  first  inclusion  states  authors  are  either  professors  or 
students,  and  the  second  inclusion  states  that  professors  and  students  are  disjoint. 
The  assertion  "author(John)"  implies  that  John  has  a  filler  of  the  role  write. 

The  logic  program  n  consists  of  the  following  rules: 

(1)  author(x)  a ->previous-author(x)  — >  new-author(x) 

(2)  student(x)  a  write(x,  y)  a  paper(y)  ->  eligible(x) 

(3)  professor(x)  a  write(x,  y)  a  paper(y)  a  new-author(x)  — >  eligible(x) 

Here  in  the  tightest  stratification  of  n,  =  {student,  professor,  author,  paper,  write, 
previous-author}  and  S2  =  {new-author,  eligible}.  Also  note  that  in  the  partition  of 
the  program  n,  we  have  n{  =  {}  and  n2  =  n.  Imagine  that  the  program  n  formalizes 
a  professional  conference  organizer’s  rules  to  decide  the  eligibility  of  authors  for 
travel  support.  Intuitively,  the  rules  say  that  student  authors  are  eligible  for  the  travel 
support,  and  that  professor  authors  are  also  eligible  if  they  have  not  previously  sub¬ 
mitted  papers  to  this  conference  series.  * 

2.3  Preferred  Models  of  A  Hybrid  Knowledge  Base 

Recall  that  a  hybrid  knowledge  base  A  consists  of  two  components  T  and  n, 
where  T  is  a  terminology  and  n  is  a  stratified  logic  program.  The  concepts  and  roles 
in  T  are  allowed  to  occur  positively  in  the  body  of  rules  in  n.  On  the  other  hand,  no 
concepts  or  roles  are  allowed  in  the  head  of  any  rule  in  n,  because  the  terminology  T 
is  supposed  to  specify  a  complete  definition  of  the  concepts  and  roles.  The  predicates 
in  n  but  not  in  T  are  called  ordinary  predicates,  which  can  be  of  any  arity. 

The  meaning  of  a  hybrid  knowledge  base  A  =  ( T ,  n)  is  given  by  a  model-theo¬ 
retic  semantics.  First,  an  interpretation  7  for  the  hybrid  knowledge  base  consists  of  a 
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nonempty  domain  D,  a  mapping  from  the  constants  in  T  and  n  to  D,  and  a  mapping 
from  each  predicate  P  of  arity  n  (including  the  concepts,  roles  and  ordinary  predi¬ 
cates)  to  P1  which  is  a  subset  of  D".  Now,  how  do  we  define  the  models  of  a  hybrid 
knowledge  base  A?  A  naive  way  would  be  to  define  a  model  of  A  as  a  model  of  both 
of  its  components,  namely,  an  interpretation  7  is  a  model  of  A  if  the  restriction  of  7 
on  predicates  in  T  is  a  model  of  T  and  the  restriction  of  7  on  the  predicates  in  n  is  a 
perfect  model  of  n.  However,  this  has  some  undesirable  consequences.  Since  con¬ 
cepts  and  roles  occur  only  positively  in  the  rule  antecedents,  any  perfect  model  of  n 
(with  respect  to  all  predicates  in  it)  will  assign  "false"  to  each  ground  instance  of 
such  concepts  and  roles.  However,  this  assignment  may  not  be  consistent  with  any  of 
the  models  of  T.  Moreover,  the  notion  of  minimization  and  perfect  models  should 
really  be  applied  to  ordinary  predicates  only.  Thus  the  desirable  definition  should 
keep  the  extension  of  concepts  and  roles  fixed  (as  in  parallel  circumscription)  accord¬ 
ing  to  the  models  of  T,  and  then  based  on  these  fixed  truth  assignments  to  concepts 
and  roles,  define  perfect  models  of  it  with  respect  to  the  ordinary  predicates.  Thus 
we  develop  the  following  definition: 

Definition  1  (Preferred  Models). 

Let  A  =  (T,  n)  be  a  hybrid  knowledge  base.  Let  7  =  lT  u  1„  be  an  interpreta¬ 
tion  of  A,  where  IT  defines  the  mapping  for  concepts  and  roles,  ln  defines  the  map¬ 
ping  for  ordinary  predicates.  Let  D  be  the  domain  of  7.  We  say  that  7  is  a  preferred 
model  of  A  if  it  satisfied  the  following  two  conditions: 

(1)  IT  is  a  model  of  T. 

(2)  ln  is  a  perfect  model  of  nDHT. 

The  collection  of  preferred  models  of  A  is  denoted  as  P refits)  =  {7:  7  is  an  interpreta¬ 
tion  and  a  preferred  model  of  A}.  A  ground  atom  P(a)  is  entailed  by  A  (under  this 
preferred  model  semantics),  written  A  j  =Pref  P(a).  if  P(a)  is  true  in  each  model  in 
Prefits).  * 

The  reasoning  problem  under  the  preferred  model  semantics  is  precisely  to 
determine  whether  we  have  A  |  =Pref  P(a)  where  P  is  an  ordinary  predicate  and  a  is  a 
tuple  of  constants.  We  do  not  have  a  decision  procedure  for  this  general  problem 
when  the  program  it  is  not  Horn.  This  is  because  the  terminology  T  may  have 
infinitely  many  models,  each  corresponding  to  a  preferred  model  of  A,  and  finding 
decision  procedures  for  A  in  this  case  may  be  difficult.  However,  if  we  modify  the 
preferred  model  semantics  and  focus  on  an  interesting,  finite  subset  of  models  of  T, 
we  will  consider  only  finitely  many  models  of  A  as  designated  ones.  Hence  the  infer¬ 
ence  problem  under  the  modified  semantics  becomes  decidable. 

What  are  the  models  of  T  that  belong  to  the  finite  subset  of  interest  in  the  mod¬ 
ified  semantics?  In  [4],  it  was  shown  that  one  can  always  find  a  finite  subset  £2  from 
the  models  of  T  such  that  for  each  ground  atom  P(a)  (P  is  an  ordinary  predicate  and 
a  is  a  tuple  of  constants),  T  u  it  |=  p{a)  if  and  only  if  7  u  n  |=  P(a)  for  each  7  e 
£2.  We  will  focus  on  precisely  these  models  of  T  in  defining  the  modified  semantics. 
According  to  [4],  the  set  £2  can  be  obtained  as  follows:  First,  we  build  an  initial  con¬ 
straint  system  ST  from  T,  which  is  equivalent  to  T  in  the  sense  that  it  has  the  same 
models  as  T.  A  constraint  system  is  a  non-empty  set  of  constraints  of  the  form  s:  C, 
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s  R  t,  Vx.  x :  C  and  s^t,  where  s  and  t  are  either  constants  or  variables,  C  is  a  con¬ 
cept  description,  R  is  a  primitive  role  name.  Second,  we  expand  ST  by  repeatedly 
applying  a  set  of  propagation  rules.  This  will  produce  a  finite  set  of  completions 
which  are  constraint  systems  such  that  no  propagation  rule  is  applicable  to  them. 
Some  of  the  completions  may  contain  a  clash,  i.e.,  a  contradiction.  For  each  of  the 
clash-free  completions,  a  unique  canonical  model  can  be  constructed.  All  such 
canonical  models  of  clash-free  completions  together  form  the  set  Q. 

A  naive  application  of  the  propagation  rules  in  expansion  of  ST  may  not  termi¬ 
nate  because  of  the  generating  rules  which  can  introduce  new  variables  to  the  con¬ 
straint  system.  Thus  the  n-tree  equivalence  condition  was  developed  in  [4]  to  assure 
expansion  termination  in  finitely  many  steps.  In  [4],  each  completion  S  is  obtained 
from  ST  by  application  of  propagation  rules  with  the  V (A)-tree  equivalence  termina¬ 
tion  condition.  Here  for  a  knowledge  base  A  =  (T,  it)  where  T  is  a  terminology  and 
n  is  a  Horn  program,  U (A)  is  the  maximal  size  (number  of  literals  in  the  antecedent) 
of  a  rule  derivable  by  chaining  the  rules  in  n.  In  our  case  of  non-Horn  stratified  pro¬ 
grams,  U( A)  is  defined  to  be  the  maximal  size  of  a  rule  derivable  by  chaining  the 
rules  in  n\,  which  is  the  first  stratum  of  the  program  rules  in  n.  When  nx  is  empty, 
U (A)  is  defined  to  be  zero. 

Now  we  are  ready  to  present  a  modified  preferred  model  semantics  as  follows: 
Instead  of  considering  all  models  1T  for  a  terminology  T  when  defining  models  of 
(T,  n),  an  interpretation  7  is  considered  only  when  IT  is  the  canonical  model  of  a 
clash-free  completion  S,  which  is  obtained  from  ST  based  on  the  U(A)-tree  equiv¬ 
alence  condition.  The  formal  definition  follows: 

Definition  2  (Preferred-Canonical  Models). 

Let  A  =  (T,  n)  be  a  hybrid  knowledge  base.  Let  7  =  lT  u  l„  be  an  interpreta¬ 
tion  of  A,  where  IT  defines  the  mapping  for  concepts  and  roles  in  T,  and  I„  defines 
the  mapping  for  ordinary  predicates.  Let  D  be  the  domain  of  7.  We  say  that  7  is  a 
preferred-canonical  model  of  A  if  it  satisfied  the  following  two  conditions: 

(1)  lT  is  the  canonical  model  of  S,  which  is  a  clash-free  completion  obtained  from 

Sj  under  the  7/ (A) -tree  equivalence  condition. 

(2)  I„  is  a  perfect  model  of  nDllT. 

The  collection  of  preferred-canonical  models  of  A  is  denoted  as  Pref-Cano(A)  =  { 7:  1 
is  an  interpretation  and  a  preferred-canonical  model  of  A}.  A  ground  atom  P{a)  is 
entailed  by  A  (under  this  preferred-canonical  model  semantics),  written  A  |  =pref-cano 
P(a),  if  P(a)  is  true  in  each  model  in  Pref-Cano(A).  * 

Note  that  when  the  program  n  is  a  Horn  program  which  is  a  special  case  of 
stratified  programs,  the  preferred  model  semantics  and  the  preferred-canonical  model 
semantics  coincide  with  respect  to  the  entailment  of  ground  atoms.  Moreover,  as  far 
as  entailment  of  ground  atom  is  concerned,  these  semantics  also  coincide  with  the 
semantics  in  [4]  defined  for  combination  of  an  ALCNR  terminology  and  a  Horn  pro¬ 
gram.  Here  we  use  the  notation  A  |  =  P(a)  to  denote  the  entailment  of  P(a )  by  A 
under  the  semantics  defined  in  [4], 
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Theorem  1. 

Let  A  =  (T,  n)  be  a  hybrid  knowledge  base,  where  T  is  a  terminology  in  the 
ALCNR  logic  and  n  is  a  non-recursive  Horn  program.  Let  P(a)  be  a  ground  atom 
where  P  is  an  ordinary  predicate  in  n.  Then 

A  |  =Pref  Pi.CC)  O  A  |  =Pref-Cano  ^(«)  <=>  A  |  =  P{a). 


3.  Decidable  Reasoning  with  Description  Logics  and  Stratified 
Logic  Programs 

From  the  discussions  in  Section  2,  we  see  that  for  a  hybrid  system  A  =  (T,  n), 
the  terminology  T  has  only  finitely  many  canonical  models  1  $  which  are  constructed 
from  clash-free  completions  S  of  ST.  Each  such  canonical  model  Is  is  finite.  By 
Definition  2,  the  preferred-canonical  models  of  the  hybrid  system  A  =  ( T ,  n)  consider 
precisely  these  canonical  models  Is  as  the  designated  models  of  T.  Moreover,  for 
each  such  canonical  model  Is ,  there  is  a  unique  perfect  model  l „  for  the  reduced  pro¬ 
gram  7tDHs.  It  follows  immediately  that  there  are  finitely  many  preferred-canonical 
models  of  A,  each  is  finite.  Thus  we  get  the  following  theorem  by  a  straight  forward 
application  of  the  decidability  result  from  [4]: 

Theorem  2. 

Let  A  =  (T,  n)  be  a  hybrid  knowledge  base,  where  T  is  a  terminology  in  the 
ALCNR  logic  and  n  is  a  stratified  logic  program.  Let  P(a)  be  a  ground  atom  where 
P  is  an  ordinary  predicate  in  n.  Then  the  problem  of  determining  whether 
A|  =  pref-cano  p(a)  is  decidable. 

The  following  outlines  the  algorithm  for  the  reasoning  problem  A  |  =pref-cano 

P{a). 

(1)  Build  the  initial  constraint  system  ST  starting  with  all  ground  atoms  in  T. 
Compute  U( A)  using  the  tightest  stratification  of  n. 

(2)  Apply  the  propagation  rules  to  ST  and  obtain  clash-free  completions  S  with  the 
U (A)-tree  equivalence  condition  for  termination.  Let  W  be  the  set  of  clash-free 
completions  obtained. 

(3)  For  each  S  e  W,  construct  the  canonical  model  Is  for  S.  Let  £2  be  the  set  of 
such  canonical  models. 

(4)  For  each  Is  e  £2,  construct  the  unique  perfect  model  lK  of  the  reduced  program 
nD  IIS  using  the  "Construct"  algorithm  below.  Here  D  is  the  domain  of  ls 
extended  by  constants  appearing  in  n  but  not  in  T . 

(5) .  If  P(a)  is  true  in  every  obtained  in  step  (4),  then  output  answer  "yes",  other¬ 

wise  output  answer  "no". 

Before  we  present  the  algorithm  for  constructing  the  unique  perfect  model  1K 
of  n  with  respect  to  a  model  Is  of  the  terminology  T,  we  need  to  introduce  several 
notations.  Recall  that  for  a  given  stratified  program  n,  we  can  partition  n  -  n\  u 
u  ...  u  7i k  according  to  its  tightest  stratification.  For  a  non-empty  set  D  which  is  the 
domain  of  an  interpretation  of  A,  we  define  Uj  to  be  the  set  of  ground  atoms  of  the 
form  A(a)  where  A  is  an  ordinary  predicate  in  stratum  j,  and  a  is  a  tuple  of  objects 
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in  D.  For  a  given  interpretation  Is  of  the  terminology  T,  we  use  n'  to  denote  the 
reduced  program  nDHs.  Clearly,  n  can  be  represented  as  n'  =  n[  u  it2  u  ...  u  7i'k, 
where  each  n]  for  1  <  j  <  k  is  the  reduction  of  {jtj)D  by  Is. 

The  "Construct"  algorithm  mentioned  above  is  the  following: 

Algorithm  Construct 

Input:  ls,  the  canonical  model  of  a  clash-free  completion  S  and  n,  a  stratified  logic 
program. 

Output: 

In,  the  unique  perfect  model  of  the  reduced  program  nDH s. 

(1)  Let  D  be  the  domain  of  Is  extended  by  adding  all  constants  in  n  but  not  in  T. 
Let  the  program  n'  be  the  reduced  program  nDIIs.  Let  Mod0  =  {  0,  0)  =  (Pq, 
No)- 

(2)  For  i  =  1  to  k  do 

(2.1)  Post  =  Negi  =  0. 

(2.2)  While  there  is  a  ground  rule  in  n\  of  the  form  If  a  Ii2  a  ...  a  Bm  — »  A, 
such  that  A  is  not  in  Post  and  each  Bj  is  in  Posh  add  A  to  Post. 

(2.3)  Let  Negi  =  U,  -  Post  and  T,  =  (Posn  Negi). 

(2.3)  Let  Pj  =  P^ i  u  PoSj,  Nj  =  A,_[  u  Negr  Let  Modi  =  (F,-,  N,)  =  Mod M 
uT,, 

(2.4)  If  i  <  k,  then  n'M  =  n'MIModj. 

(3)  Output  In  =  Modk  as  the  result. 

Example  2.  Consider  the  hybrid  knowledge  base  A  in  Example  1.  Here  U( A) 
=  0.  This  knowledge  base  has  two  preferred-canonical  models  /j  and  /2,  with  the 
same  domain  D  =  {John,  vx }.  Both  /]  and  I2  contain  the  ground  atoms 
{author(John),  write{John,V\),  paper(vi),  eligible(John),  ~<previous-author( John), 
new-author(John)} .  !\  contains  student(John)  and  ^professor(John),  while  /2  con¬ 
tains  professor(John)  and  -istudent(John).  Since  both  models  contain  eligible(John), 
it  follows  that  A  |  =pref-amo  eligible(John). 

To  derive  eligible(John),  program  rule  (2)  is  used  in  building  /[,  and  program 
rules  (1)  and  (3)  are  used  in  constructing  /2.  Note  the  use  of  negation  as  failure  in 
constructing  /2.  The  reduced  program  n  consists  of  two  ground  rules  previous- 
author(John)  — >  new-author(John)"  and  " new -author( John)  — >  eligible(John)".  In 
applying  the  "Construct"  algorithm,  we  get  ->previous-author(John)  in  Modx  by 
negation  as  failure,  and  thus  subsequently  we  get  new -author( John)  and  eligi- 
ble(John)  in  Mod2 ■  We  can  not  infer  eligible) John)  if  we  do  not  use  the  perfect 
model  semantics  which  sanctions  the  negation  as  failure  inference  illustrated  above. 
* 
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4.  Conclusions 

In  this  paper,  we  present  a  new  hybrid  knowledge  representation  system  which 
allows  the  combination  of  description  logics  with  stratified  logic  programs.  Two 
semantics  are  defined  for  the  new  hybrid  knowledge  representation.  It  is  shown  that 
under  the  preferred-canonical  model  semantics,  the  inference  problem  is  decidable. 
Algorithms  for  performing  such  inferences  are  also  presented. 

The  work  reported  here  is  still  quite  preliminary,  and  further  studies  are  needed 
to  investigate  the  decidability  of  reasoning  under  the  preferred  model  semantics  when 
the  terminological  cycles  are  allowed. 
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Abstract  When  a  designer  has  the  delicate  task  to  integrate  new  or  badly 
specified  needs,  it  is  not  easy  specially  within  engineering  applications  having 
significant  class  hierarchies  with  bulky  object  bases.  He  is  only  sure  about  the 
changes  to  bring  punctually,  more  explicitly  on  instances.  We  propose  to  this 
kind  of  designer  a  simulation  tool  of  class  evolution  according  to  structural 
evolution  of  instances.  It  has  to  provoke  emerge  of  new  and  adapted  conceptual 
abstractions  and  to  detect  their  position  in  the  class  hierarchy.  The  objective  of 
this  article  is  to  analyze  emergent  abstractions  by  using  metrics. 


1.  Introduction 

A  designer  has  a  great  number  of  strategies  to  manage  the  evolution  of  engineering 
applications  [6].  A  designer  can  use,  according  to  his  needs,  one  or  several  strategies. 
However,  experience  gained  in  00  systems  and  applications  [10]  has  brought  to  light 
that  new  needs  appear  more  often  during  their  manipulation,  so  during  manipulating 
instances. 

In  order  to  face  unforeseen  changes  for  complex  and  bulky  engineering  applications, 
we  propose  to  a  designer  a  simulation  tool  of  class  evolution.  This  tool  searches  and 
releases  several  possible  directions  of  evolution  of  specifications  thanks  to  dynamic 
evolution  of  instance  structure.  The  general  principle  is  to  provoke  the  emergence  of 
conceptual  abstractions  more  adapted.  This  emergence  is  based  on  these  newly 
expressed  requirements  and  one  those  already  present  in  the  database.  Hereafter,  the 
simulation  tool  has  to  detect  the  location  of  these  new  abstractions  in  the  hierarchy 
and  to  determine  possible  impac4ts. 


2.  The  object  evolution:  a  state  of  the  art 

In  a  general  way,  to  prepare  a  system  or  an  application  to  evolve,  it  is  necessary  to 
be  able: 

1.  to  formulate  changes  in  order  to  achieve  the  pursued  goal,  namely  the  model  after 
evolution; 

2.  to  manage  the  impacts  generated  by  these  changes; 
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3.  to  define  the  link  between  the  starting  model  and  the  arrival  one  (it’s  the  same  but 
in  two  different  stages). 

00  defined  systems  and  programming  languages  propose  strategies  and  mechanisms 
to  manage  evolution.  Experience  gained  in  OO  design  and  development  outlines,  as 
well  at  design  level  as  implementation  one,  lacks  of  actual  evolutionary  approaches. 
Existing  strategies  are  varied  and  meet  partially  requirements.  We  propose  to  examine 
these  strategies  according  to  different  viewpoints  or  facets  of  evolution: 


2.1  The  three  Facets  of  Evolution 

Rather  than  classifying  some  evolutionary  strategies  according  to  some  common 
but  not  exhaustive  criteria,  we  propose  a  classification  resting  on  own  evolution’s 
criteria.  We  consider  that  evolution  of  any  OO  system  presents  three  facets: 

□  Type  of  evolution:  when  needs  are  taken  into  account  during  the  analysis  and 
design  phases,  the  evolutionary  strategy  is  preventive  or  anticipated.  When  an 
evolutionary  strategy  can  face  new  or  badly  specified  needs,  the  evolution  is  said 
curative  or  unanticipated. 

□  Object  of  Evolution:  the  evolution  can  concern  be  the  product  (code,  class, 
schema...)  or  the  process  (a  part  of  reasoning,  a  development  process  of  an 
application...). 

□  Process  of  Evolution:  we  distinguish  two  kinds  of  evolutionary  processes: 
development  and  emergence.  The  development  concerns  classes  and  their  impacts 
on  corresponding  sub-classes  and  instances.  The  emergence  concerns  instance 
evolution  and  their  impacts  on  corresponding  classes. 

Each  facet  is  represented  by  an  axis.  The  combination  of  the  three  axis  gives  a  three- 
dimension  representation  of  the  object  evolution. 


Fig.  1.  Three  facets  of  the  evolution  object 


In  order  to  classify  a  strategy,  we  have  just  to  answer  these  three  questions: 
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1.  What  kind  of  evolution  this  strategy  propose  (curative  or  preventive)? 

2.  What  does  it  process  on  (product  or  process)? 

3.  What  kind  of  evolutionary  process  does  it  allow  (development  or  emergence)? 

By  responding  to  these  questions  for  each  studied  strategy  [9],  we  notice  that  most  of 
them  propose  a  curative  evolution,  that  they  principally  work  on  the  product  by 
considering  essentially  development  process.  Our  research  work  aims  to  complete 
these  preventive  strategies.  For  that,  our  model  responds  to  the  three  aforesaid 
questions  as  follows:  1:  curative  -  2:  product  -  3:  principally  emergence. 


2.2  Object  Evolution  problematic  under  the  product  viewpoint 


We  restrict  Fig.  1  to  a  two-dimension  figure  because  we  only  consider  the  product.  For 
space  restriction  reason,  we  only  situate  evolutionary  strategies  on  the  Fig.  2  in  order 
to  position  them  according  to  the  object  evolution  problematic.  Fig.2  shows  that  most 
of  00  evolutionary  strategies  are  preventives  and  allow  development  processes.  Only 
categorization  [7]  allows  emergence  but  in  a  preventive  way  with  a  break  in  the  life 
cycle.  We  note  that  the  principal  lack  of  the  existing  evolutionary  strategies  are  their 
inability  to  cope  with  unexpected  or  poorly  specified  needs  and  incomplete  data. 
Moreover,  instance  evolution  is  always  limited  by  class’  one.  This  situation  constitute 
a  restrictive  and  unnatural  aspect  of  their  evolution.  Our  model  leads  with  this  aspect, 
principally  with  the  emergence. 
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Fig.  2.  Principal  strategies  according  to  the  process  and  the  type  of  evolution 


3.  The  proposed  model 

Because  instances  are  the  object  representatives  of  real  entities,  we  consider  them 
as  full  individuals.  This  leads  de  facto  to  make  an  analogy  with  leaving  individuals, 
principally  in  their  evolving  and  adapting  feature  according  to  their  environment. 
Since  we  use  some  principles  and  concepts  of  Artificial  Evolution,  we  give  a  brief 
presentation  of  Artificial  Life  and  Genetic  Algorithms.  For  more  details,  see  [9]. 
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3.1  Artificial  Evolution:  our  inspiration  source 

Even  if  they  have  been  defined  and  are  used  in  different  scientific  areas,  Artificial 
Life  [4]  and  Genetic  Algorithms  [3]  toke  their  inspiration  from  biology  in  order  to 
simulate  evolutionary  biological  mechanisms.  From  evolutions  and  mutations,  newly 
well  adapted  parts  of  information  arise.  We  have  been  attracted  by  this  principle. 
Artificial  Life  uses  concepts  of  GTYPE  and  PTYPE  (by  analogy  to  genotype  and 
phenotype  of  biology).  They  evolve  by  unceasingly  interacting  through  development 
and  emergence  processes.  Genetic  Algorithms  [3][5]  inspire  us  in  their  mechanical 
operating  and  used  operators. 


3.2  Concepts 

1.  Basic  concepts:  population,  Instance-PTYPE  and  Class-GTYPE: 

□  Population  and  Genetic  patrimony:  a  group  of  classes  representing  various 
abstractions  of  one  and  the  same  entity  forms  a  population  (like  the  population  of 
members  of  a  university).  All  the  attributes  constitute  its  Genetic  Patrimony. 

□  Instance-PTYPE:  instances  are  the  phenotype  and  represent  entities  called  upon  to 
evolve. 

□  Class-GTYPE1:  classes  define  instances  features,  their  genetic  code. 

We  present  an  example  which  will  be  taken  again  and  unrolled  all  along  the  article  : 


Fig.  3.  Members  of  a  university  described  at  the  class  level 

2.  Advanced  concepts:  Fundamental,  Inherited  and  Specific  Genotypes:  In  a 

class,  not  every  gene  plays  the  same  role  or  has  the  same  prevalence.  We  consider 
that  any  class  is  entirely  specified  through  three  types  of  genotypes: 

□  Fundamental  Genotype  or  FG:  any  object  presents  fundamental  features, 
represented  by  particular  genes  representing  the  minimal  semantics  inherent  to  all 
classes  of  a  same  population. 


1  In  order  to  simplify,  we  will  use  the  classical  term  of  class  (respectively,  instance)  in  place  of 
Class-GTYPE  (respectively,  Instance-PTYPE). 
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C  Inherited  Genotype  or  1G:  properties  inherited  by  a  class  from  its  super-class 
constitute  the  Inherited  Genotype. 

C  Specific  Genotype  or  SG:  it  consists  of  properties  locally  defined  within  a  class, 
specific  to  it. 

3.  The  scheme:  the  scheme  expresses  in  a  simple  and  concise  way,  genes  in  the 
attributes  and  methods  form.  It  has  the  same  genetic  structure  as  the  represented 
entity.  Each  gene  is  represented  by  0,  1  or  #:  0  for  absence  of  the  gene,  1  for  its 
presence  and  #  for  its  indifference.  The  scheme  is  a  simple  and  powerful  means  to 
model  groups  of  individuals.  We  consider  two  kinds  of  schemes:  the  permanent 
scheme,  associated  with  each  specified  class  and  having  the  same  structure,  and  the 
temporary  scheme,  which  is  a  selection  unit  of  one  or  a  group  of  entities  (instances 
or  classes). 


The  example’s  model  presented  in  Fig. 3  is  described  in  Fig.  4  taking  into  account  the 
aforesaid  concepts. 


1 

1 


Fig.  4.  the  same  example  in  our  model 


3.3  Evolutionary  Processes 

An  evolutionary  process  is  triggered  when  a  change,  envisaged  or  not,  appears  in 
the  model.  The  process  must  be  able  to  detect  this  change,  find  entities  implicated  in 
the  evolution  and  reflect  this  change  adequately: 

1.  Phases:  we  consider  that  an  instance’s  evolutionary  process  is  carried  out  in  three 
phases:  an  extraction  phase,  an  exploration  phase,  and  finally  an  exploitation 
phase: 
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□  Extraction  Phase:  extracts  the  object’s  genetic  code  within  a  temporary  scheme. 

□  Exploration  phase:  explores  all  classes  to  locate  adapted,  even  partially,  ones.  First 
it  selects  set  of  concerned  populations,  then  it  carries  out  the  search  in  that  set. 
Selection  is  the  operator  used  thanks  to  the  calculation  of  the  adaptation  values 
A  vs  (section  2). 

□  Exploitation  phase:  manages  the  impacts  by  development  or  emergence  way.  The 
development  process  represents  the  impact  of  class  evolution  on  instances,  while 
the  emergence  processes  concern  any  emergence  of  new  conceptual  information, 
by  way  of  impacts  on  classes.  There  are  two  possible  outcomes:  local  emergence  is 
related  to  the  emergence  of  new  information  within  existing  class(es).  The  genetic 
code  of  the  object  has  mutated  and  this  can  force  mutation  in  its  class;  and  the 
global  emergence  related  to  the  emergence  of  a  new  conceptual  entity. 

2.  Object  operators:  it  is  necessary  to  define  basic  operators  to  handle  instances  and 
classes.  The  two  most  important  are  those  of  selection  and  crossing-over: 

□  Selection:  is  defined  to  determine,  after  structural  evolution  of  an  instance,  which 
class  holds  part  or  all  of  its  specification. 

□  Crossing-over:  works  on  two  entities  via  their  scheme  to  interchange  their  genes  to 
define  a  new  group  of  genes.  It  constitutes  the  core  of  the  emergence  process  (took 
from  [8]).  It  amounts  to  granting  a  weight  relating  to  parents  for  genes  transmission 
to  children.  We  add  to  that  a  significant  constraint:  a  permanent  scheme  presents  at 
most  two  significant  blocks  (after  FG):  IG’s  genes  and  SG’genes.  When  processing 
the  crossing-over,  these  blocks  must  be  respected.  It  is  the  constraint  of  bocks  of 
genes.  The  crossing-over  is  guided  by  the  block  constraints  in  order  to  ensure  a 
minimal  coherence  for  emergent  schemes. 

□  Adaptation  value  Av:  calculate  the  semantic  distance  between  the  evolved  object 
and  classes.  Denoting  the  evolved  object’s  scheme  by  Schobj  and  the  close  class’s 
scheme  by  Schparim,  the  adaptive  function  is  defined,  using  the  operator  a 
(and_logic):  Av  (Sch^J  -  £(i  =  i-»„)  {Schobj[i]ASchpaiiini[i]}/n,  where  n  is  number  of  genes 
specified  in  the  evolved  object;  i  is  the  variable  index  from  1  to  n,  defining,  at  each 
stage,  the  position  of  two  respective  genes  of  the  analyzed  schemes. 

□  Semantic  Distance  sd :  is  the  value  which  expresses  the  semantic  proximity 
between  an  emergent  scheme  and  one  of  its  ancestor.  It  helps  to  choose  the 
super-class  of  the  new  abstraction.  We  use  the  same  adaptive  function  defined  for 
the  calculation  of  Avs. 

3.  Examples:  we  consider  following  instances  which  not  only  evolve  in  their  first 
structure,  but  they  also  introduce  new  attributes  to  become: 
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□  Exploration  phase:  we  calculate  the  Adaptation  Value  Av  for  each  temporary 


scheme  with  each  existing  class: 
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Conclusions  for  each  object:  O , :  partially  adapted  classes:  Teacher,  Ater,  Research- 
Student  and  Senior-Researcher.  They  are  candidates  for  crossing-over.  02 :  no 
adapted  class.  03 :  one  completely  adapted  class:  Senior-Researcher.  04 :  three 
completely  adapted  classes:  Temporary,  Ater  and  Research-Student.  05 :  partially 
adapted  classes:  Teacher,  Ater,  Research-Student  and  Senior-Researcher.  They  are 
candidates  for  crossing-over 


□  Exploitation  phase 

□  O,:  crossing-over  on  Teacher,  Ater  and  Senior-Researcher’  schemes.  Avp  is  an 
adaptation  value  which  is  pondered  with  the  other  Avs. 
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Absent  gene  from  all  schemes  are  ignored  in  the  crossing-over  since  it  has  no  significance 

Crossing-over  ends  because  we  have  two  emergent  schemes:  E4  and  E5. 


□  02:  global  emergence  by  direct  creation  of  the  abstraction  -  place  of  insertion?  As 
a  sub-class  of  University-Member  because  02  has  only  the  FG  which  is  common 
with  all  other  classes. 

□  03:  local  emergence  of  the  new  attribute  ‘Responsibility’  in  Senior-researcher 
class  because  it  is  the  unique  completely  adapted  class. 

□  04:  global  emergence  by  direct  creation  of  the  abstraction  -  place  of  insertion? 
Three  Avs  are  equal  to  1.  Among  the  three  well-adapted  classes.  Temporary  is 
the  most  eligible  because  it’s  the  representative  abstraction  of  the  temporary 
subpopulation.  But  as  it’s  an  abstract  class,  04  represents  another  abstraction  of 
temporary  researchers.  So,  it  provokes  emergence  of  a  new  sub-class  of 
temporary. 

□  Os:  crossing-over  on  the  same  population  as  O,.  Crossing-over  steps  are  the 
same.  The  final  step  stops  with  the  emergent  scheme  E5. 


3.4  Discussion  on  emergence 

After  the  structural  evolution  of  O,, ...,  Oz,  the  emergent  processes  permit  to  detect 
the  kind  of  emergence,  the  abstractions  concerned  and  also  the  location  of  changes 
and  insertion  of  new  classes  (there  are  more  details  in  [9]).  But  sometimes,  we  can  just 
conclude  that  the  emergent  scheme  is  an  abstraction  of  the  permanent  sub-population, 
but  not  it’s  precise  location  in  this  sub-population.  We  propose  to  enrich  the 
emergence  process  in  order  to  control  it  and  to  choose  in  a  better  way  by  using 
metrics. 
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4.  Model  of  metrics 

We  apply  the  GQM  technique  [l]  in  order  to  determine  where  metrics  are  useful  in 
the  emergence  process.  We  take  our  inspiration  from  class  context  metrics  [2].  As  we 
want  to  analyze  any  emergent  abstraction  in  an  internal  and  external  way,  we  identify 
two  contexts  for  metrics: 
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Metric 

Comments 

Intra- 

Abstraction 
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the  most  significant 

S=difference  between  scheme  and  instance 

nearest  scheme  to  instance 

the  less 

contradictory  ? 

C  =  number  of  contradictions  between  attributes  and 
attribute  blocs 

weaker  =less 
contradictions 

the  most  coherent ? 

Ch  =£simple  attributes  +X  attribute  contradictions  +  D 
blocs  +  P  blocs  contradiction 

weaker  =less  incoherence 

Inter- 

Abstraction 

Context 

parents  ? 

P,  =  detection  of  super-class(s) 

Sd 

Contradiction  with 
parents? 

C=Pcontradictory  attributes+Pcontradictory  blocs 

weaker=less  contradiction 

Coupling 

C  =  preferences  towards  and  from  other  classes 

Depth  in  hierarchy? 

P„  =  position  inside  the  hierarchy 

■kiS&®fSf sISISHfw  rift  iMf.l 

Since  we  apply  these  metrics,  we  could  better  choose  and  apply  emergence  results. 


5.  Conclusion 

The  first  objective  of  our  research  work  is  to  allow  a  designer  to  attempt  to 
apprehend  and  to  anticipate  the  future  changes  and  requirements  of  complex  and 
bulky  OO  applications.  This  is  possible  by  simulating  several  evolution  ways  by 
expressing  new  requirements  on  instances.  We  have  seen  in  fact  that  several  ways  of 
class  evolution  can  emerge  from  structural  instance  evolution.  In  this  paper,  we 
propose  metrics  in  order  to  analyze  and  control  what  emerge,  how  it  can  change  class 
specifications  and  the  possible  impacts.  We  propose  two  kinds  of  metrics: 
intra-abstraction  context  metrics  and  inter-abstraction  context  metrics.  Metrics  are 
not  systematically  applied  if  the  designer  precise  invariants  to  respect  during 
evolution  and  emergence.  All  this  is  done  in  order  to  offer  a  simulation  tool  of 
application  evolution  to  help  the  designer  for  evolution  and  maintenance  of  complex 
applications. 
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Abstract.  In  this  study,  we  designed  a  fuzzy  logic  system  to  examine  the  influ¬ 
ence  of  the  demographic  variables  of  age,  blood  type,  gender  and  race  on  bacte¬ 
rial  infection  rates  using  a  medical  database  assembled  over  17  months  from 
patients  presenting  to  Albert  Einstein  Medical  Center.  The  intelligent  system 
was  created  using  155  patients,  randomly  selected  from  the  database,  and  con¬ 
sisted  of  four  input  categories  of  demographic  variables  and  four  output  catego¬ 
ries  of  bacterial  infection  (“streptococci”,  “staphylococci”,  “Escherichia  coli” 
and  “non-£.  coli  gram  negative  rods”).  The  remaining  32  patients  were  used  to 
assess  the  program’s  ability  to  correctly  determine  bacterial  infection  when  pro¬ 
vided  only  with  demographic  data.  Our  intelligent  system  correctly  assigned  the 
bacterial  output  group  in  27  of  these  32  patients,  giving  an  overall  correlation  of 
84.4%.  These  studies  suggest  that  demographic  variables  are  major  factors  in¬ 
fluencing  bacterial  infection.  Such  a  system  may,  therefore,  hold  promise  as  a 
diagnostic  tool. 


1  Introduction 

The  ability  of  physicians  to  diagnose  bacterial  infections  is  currently  dependent  on  the 
use  of  a  series  of  developed  algorithms  in  which  the  most  likely  etiological  agent  is 
determined  based  on  the  patient’s  symptoms,  previous  history  and  predisposing 
physiological  factors.  Improvement  of  this  diagnostic  tool  would  involve  identifying 
additional  variables,  which  might  serve  as  risk  factors  for  infection  by  a  particular 
bacterial  species  or  genus. 

Previous  studies  have  indicated  that  the  demographic  variables  of  age  and  blood 
type  might  act  as  predisposing  agents  in  bacterial  infection  [1-3].  Indeed,  advanced 
age  has  been  shown  to  be  a  risk  factor  for  pneumococcal  infection  [1]  and  expression 
of  blood  types  A  or  AB  appears  to  predispose  individuals  to  tuberculosis  [2]  or  cholera 
[3]  infection.  In  the  case  of  tuberculosis  infection  blood  type  expression  may  even  be  a 
major  risk  factor;  a  study  of  the  Innuit  showed  that  the  infection  was  three  times  more 
common  in  individuals  of  blood  types  A  and  AB  than  any  other  group  [2],  Using  tra¬ 
ditional  statistical  methods,  these  studies  have  been  able  to  show  a  putative  role  for 
individual  demographic  variables,  as  risk  factors  in  bacterial  infections  but  shed  no 
further  light  as  to  how  these  variables  might  be  involved  in  the  course  of  bacterial 
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disease.  The  development  of  the  fields  of  artificial  intelligence  (fuzzy  logic)  and  ge¬ 
netic  algorithms  now  allow  the  creation  of  computer  programs  in  which  complex 
associations  between  several  variables  can  be  “learned”  and  used  predict  the  outcome 
of  given  situations  [4,5].  Fuzzy  logic  programming  has  proven  to  be  of  particular  use 
for  the  development  of  models  of  biological  and  medical  systems  since  these  fre¬ 
quently  include  “shades  of  gray  or  maybe”  interactions  between  several  variables 
which  can  not  be  efficiently  analyzed  using  traditional  computerized  statistical  meth¬ 
ods  [4,5]. 

The  association  between  demographic  factors  and  bacterial  infections  repre¬ 
sents  a  highly  suitable  system  to  model  using  fuzzy  logic,  as  all  variables  are  definable 
using  well-established  parameters.  A  prospective  investigation  was  therefore  under¬ 
taken  to  examine  the  association  between  blood  type,  age,  gender  and  race  and  bacte¬ 
rial  infection  rates,  using  a  medical  database  obtained  over  a  17-month  period  from 
187  patients  presenting  to  the  wards  of  Albert  Einstein  Medical  Center.  To  investigate 
how  closely  the  variables  of  blood  type,  age,  gender  and  race  were  associated  with 
bacterial  infection,  the  fuzzy  logic  program  generated  from  these  patients’  data  was 
tested  for  the  ability  to  correctly  ascribe  bacterial  infection  when  given  only  the  demo¬ 
graphic  data  of  32  randomly  selected  patients. 


2  Methods 


2.1  Data  Collection 

Collection  of  data  was  governed  under  the  rules  of  a  collaborative,  expedited  IRB 
agreement  (HN-2092)  made  between  Philadelphia  University  and  Albert  Einstein 
Medical  Center.  Each  patient  provided  written  consent,  permitting  confidential  use  of 
their  data,  and  was  admitted  into  the  study  if  their  infection  resulted  from  a  single, 
identifiable  pathogen.  All  bacteriology  data  was  obtained  courtesy  of  the  Clinical 
Laboratories  of  Albert  Einstein  Medical  Center.  Patients  not  included  in  the  study 
included  those  with  underlying  disease  predisposing  to  infection,  pregnancy,  mental 
disability  or  minor  patients. 


2.2  Medical  Database  and  Fuzzy  System 

The  data  set  investigated  consisted  of  a  real  medical  database  comprising  187  patients. 
Patient  data  was  randomly  assigned  into  two  categories;  training  data  (155  patients) 
and  test  data  (32  patients;  8  patients  within  each  of  the  four  bacterial  output  groups). 
The  intelligent  system  was  modeled  using  the  155  patient  training  data  using  four 
input  classes  (demographic  variables  of  age,  blood  type,  gender  and  race)  and  four 


Using  Intelligent  Systems  in  Predictions  of  the  Bacterial  Causative  Agent  351 

output  classes  (bacterial  infections  with  the  species  “staphylococci”  ( S .  aureus  and  S. 
epidermidis ),  “streptococci”  (S.  pneumoniae,  and  groups  B  and  D  streptococci),  “Es¬ 
cherichia  coli”  and  “non-E.  coli  gram  negative  rods”  (species  of  Klebsiella,  Serratia, 
Bacteroides,  Morganella,  Prevotella,  Pseudomonas  and  Proteus).  The  four  output  and 
input  spaces  were  divided  into  several  fuzzy  subsets  and  assigned  linguistic  terms. 
Decision  surfaces  for  two  inputs;  age  and  blood  type  and  the  output  bacterial  classes 
are  shown  in  Figure  1.  Each  region  was  then  assigned  a  fuzzy  membership  function.  A 
triangular  shape  was  selected  with  height  I  at  the  center  of  the  region  and  50%  overlap 
between  neighboring  sets  (for  the  input  parameters).  Fuzzy  sets  for  the  output  pa¬ 
rameters  are  shown  in  Figure  2.  Fuzzy  rules  were  generated,  using  “IF... 
AND... .THEN”,  where  “IF... AND”  were  generated  from  the  input  parameters  and 
“THEN”  from  the  output  parameters.  The  system  generated  173  rules,  159  of  which 
received  the  highest  count  and  were  retained.  A  fuzzy  interference  engine  was  exe¬ 
cuted  and  mapping  made  based  on  the  159  remaining  rules  using  correlation  product 
interference.  Defuzzification  of  the  data  was  based  on  the  center  of  gravity  method, 
which  was  sensitive  to  all  remaining  rules.  The  32  patients  constituting  the  “test”  set, 
in  that  they  had  been  previously  unseen  by  the  system,  consisted  of  8  patients  clini¬ 
cally  defined  as  belonging  to  each  of  the  four  created  output  groups. 


Fig.  1  -  Decision  surfaces  for  two  inputs  (age  and  blood  type)  and  the  output  (bacteria)  demon¬ 
strating  correlation  between  blood  type/  patient’s  age  and  the  type  of  bacteria 
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Fig.  2  Fuzzy  sets  representing  output  bacterial  classes.  The  four  output  bacterial  classes  of 
staphylococci,  streptococci,  E.  coli  and  non-E.  coli  gram  negative  rods  were  each  described,  in 
the  order  shown,  using  a  triangular  shape  with  height  1  at  the  center  of  the  region  and  no  over¬ 
lap  between  classes 


3  Results 


3.1  Age,  Blood  Type,  Gender  and  Race  Distribution  for  the  Four  Output  Classes 
of  Bacterial  Infection  within  the  Patient  Database 

The  demographic  and  bacterial  infection  variables  of  the  patients  in  the  medical  data¬ 
base  generated  for  this  study  are  shown  in  Table  1.  Of  the  187  patients,  64  were  in¬ 
fected  by  staphylococcal  species  (34%  of  total  infections).  Patients  infected  with 
staphylococcal  species  per  se  demonstrated  more  than  a  two-fold  increase  in  the  fre¬ 
quency  of  blood  type  B  (22%)  in  comparison  with  that  normally  observed  in  the  gen¬ 
eral  population  (10%)  with  a  decrease  in  the  frequency  of  type  A  (31%  compared 
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Table  1.  Age,  blood  type,  gender  and  race  distribution  for  the  four  output  classes  of  bacterial 
infection  Frequency  in  each  output  class  (%) 


Input 

Age 

Staphylococci 
(n  =  64) 

Streptococci 
(n  =  40) 

E.coli 
(n=  28) 

Non-E. 
coli 
GNR 
(n=  55) 

Total 
(n  =  187) 

20-40 

6  (9%) 

5  (12%) 

3  (11%) 

8(14%) 

22(12%) 

41-60 

15  (23%) 

18  (45%) 

4(14%) 

13  (24%) 

50  (27%) 

61-80 

33  (52%) 

10  (25%) 

16  (57%) 

23  (42%) 

82  (44%) 

81-100 

Blood 

type 

10(16%) 

7(18%) 

5  (18%) 

11  (20%) 

33  (17%) 

A 

20  (31%) 

15  (38%) 

9  (32%) 

17  (31%) 

61  (33%) 

AB 

2  (3%) 

4(10%) 

1  (4%) 

3  (5%) 

10  (5%) 

B 

14  (22%) 

9  (22%) 

2  (8%) 

5  (9%) 

30(16%) 

O 

Gender 

28  (44%) 

12  (30%) 

16  (56%) 

30  (55%) 

86  (46%) 

Male 

38  (59%) 

27  (68%) 

12  (43%) 

31  (56%) 

108  (58%) 

Female 

Race 

26  (41%) 

13  (37%) 

16  (57%) 

24  (44%) 

79  (42%) 

African- 

American 

34  (53%) 

23  (58%) 

14  (50%) 

35  (64%) 

106  (57%) 

Asian 

2  (3%) 

1  (2%) 

2  (7%) 

0 

5  (3%) 

Caucasian 

26  (41%) 

13  (33%) 

11  (39%) 

17  (31%) 

67  (36%) 

Flispanic 

2  (3%) 

3  (7%) 

1  (4%) 

3  (5%) 

9  (4%) 

Groups  of  bacterial  species  defining  output  classes  used  in  the  Table  were  defined  according 
to  standardized  microbiological  classifications  [6-8] 

to  a  normal  frequency  of  40%).  Frequencies  of  types  AB  and  O  were  similar  to  the 
general  population. 

The  40  patients  who  were  infected  by  streptococcal  species  demonstrated 
more  than  a  two-fold  increase  in  the  frequency  of  both  blood  types  AB  (10%)  and  B 
(22%)  in  comparison  with  the  general  healthy  population  (5%  and  10%,  respectively) 
with  a  decrease  in  the  frequency  of  type  O  (30%  compared  with  45%).  The  frequency 
of  type  A  was  similar  to  that  expected  in  the  general  population  at  large. 

E.  coli  infections  accounted  for  28  of  the  187  patients  in  the  study  (15%  of 
total  infections).  There  was  an  increase  in  the  frequency  of  type  O  in  these  patients  of 
10%  compared  with  the  expected  frequency  from  the  general  population  and  a  com¬ 
mensurate  decrease  in  frequency  of  type  A. 
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Infections  with  non-£.  coli  GNR  constituted  55  of  the  187  patient  admissions 
(29%)  and  involved  several  species  of  bacteria:  Klebsiella  pneumoniae  (n=14),  Pro¬ 
teus  mirabilis  (n=ll),  Bacteroides  species  (n=10),  Serratia  species  (n=7),  Enterobac- 
ter  cloacae  (n=4),  Pseudomonas  species  (n=4),  Morganella  morgani  (n=3)  and 
Prevotella  species  (n=2).  Interestingly,  the  blood  group  distribution  for  this  broad- 
spectrum  category  was  similar  as  that  for  E.  coli. 

In  all  output  groups  except  one,  the  gender  distribution  was  similar  to  that  of 
the  general  patient  population  with  slightly  more  males  than  females  (58%  compared 
with  42%  in  each  group).  Interestingly,  in  the  E.  coli  output  group,  the  reverse  was 
true  with  only  one-third  of  those  infected  being  male  and  two-thirds  female.  Three 
groups  also  showed  a  similar  age  distribution  (staphylococci,  E.  coli  and  non-E.  coli 
GNR),  with  the  majority  of  patients  being  over  60  (68,  75  and  62%,  respectively).  In 
contrast,  the  streptococcal  group  was  somewhat  younger  with  57%  of  patients  being 
60  or  less.  Race  distribution  was  found  to  be  fairly  homogeneous  across  the  four  out¬ 
put  groups. 


3.2  Intelligent  System  as  a  Predictor  of  Bacterial  Output  Class 

The  novel  fuzzy  logic  system  generated  from  155  of  the  187  patient  database  was  also 
tested  for  its  ability  to  correctly  determine  the  bacterial  infectious  agent  of  the  re¬ 
maining  32  patients  when  provided  solely  with  demographic  data.  The  program  was 
able  to  correctly  assign  all  patients  with  streptococcal  infections  (8/8),  7  out  of  8  pa¬ 
tients  with  E.  coli  infections  and  6  out  of  8  patients  with  either  staphylococcal  or  non- 
E.  coli  gram  negative  rod  infections  (non-E.  coli  GNR)  to  their  output  groups.  This 
gave  an  overall  prediction  rate  for  the  patient  sample  of  27/32  or  84.38%. 

The  5  patients  who  were  incorrectly  classified  by  the  program  were  placed 
into  the  following  output  groups:  2  patients  with  non-E.  coli  GNR  were  assigned  to  the 
E.  coli  category  ,  2  patients  with  staphylococcal  infections  and  1  patient  with  E.  coli 
infection  were  assigned  to  the  streptococcal  category.  This  system  is  currently  under 
development  and,  as  such,  no  clinical  evaluation  of  its  efficacy  at  diagnosing  bacterial 
infection  in  the  absence  of  standard  clinical  algorithms  has  yet  been  performed. 


4  Discussion 

These  data  suggest  that,  with  training,  bacterial  infection  may  be  predicted  with  rela¬ 
tive  efficiency  by  inputting  a  patient’s  data  for  blood  type,  gender,  age  and  race  into  a 
fuzzy  logic  program.  There  have  been  previous  studies  in  the  literature,  which  have 
suggested  a  putative  correlation  between  demographic  variables  and  bacterial  infection 
[1-3],  but  it  is  not  until  now,  with  the  advent  of  this  powerful  analytical  tool  that  the 
dynamics  between  these  factors  can  be  interpreted. 

By  designing  a  fuzzy  logic  system,  using  a  real  medical  database,  we  were 
able  to  correctly  diagnose  27  patients  from  the  32  patient  test  group  using  only  the 
demographic  variables  of  age,  blood  type,  gender  and  race,  for  four  separate  groups  of 
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infectious  agents,  namely  staphylococci,  streptococci,  Escherichia  coli  and  non-E.  coli 
gram  negative  rods.  Of  these  four  output  groups,  the  program  was  able  to  correctly 
assign  all  patients  with  streptococcal  infections  (8/8),  7  out  of  8  patients  with  E.  coli 
infections  and  6  out  of  8  patients  with  either  staphylococcal  or  non-E.  coli  gram  nega¬ 
tive  rod  infections  (non-E.  coli  GNR)  to  their  output  groups.  This  gave  an  overall 
prediction  rate  for  the  patient  sample  of  27/32  or  84.38%.  The  5  patients  who  were 
incorrectly  classified  by  the  program  were  placed  into  the  following  output  groups:  2 
patients  with  non-E.  coli  GNR  were  assigned  to  the  E.  coli  category  ,  2  patients  with 
staphylococcal  infections  and  1  patient  with  E.  coli  infection  was  assigned  to  the 
streptococcal  category. 

This  finding  is  particularly  impressive  since  staphylococci  and  streptococci 
(output  groups  1  and  2)  have  a  number  of  features  in  common  [9,10].  Both  organisms 
inhabit  the  same  microbial  niches,  being  chiefly  residents  of  the  skin  and  mucous 
membranes,  have  a  similar  structure  to  their  outer  cell  wall  (gram-positive),  produce 
closely  similar  virulence  factors  for  invasion  and  produce  similar  spectra  of  disease 
[9,10],  As  a  result,  they  are  often  difficult  for  clinicians  to  distinguish  based  on  clinical 
algorithms  alone  and  in  the  absence  of  microbiological  laboratory  data.  Interestingly 
the  remaining  two  output  groups,  namely  the  E.  coli  and  the  non-E.  coli  GNR  catego¬ 
ries,  also  share  a  number  of  features.  Both  E.  coli  and  non-E.  coli  GNR  bacteria  in¬ 
habit  similar  microbial  niches,  being  chiefly  residents  of  the  gastrointestinal  tracts  of 
humans  and  animals,  are  gram  negative  organisms  and  produce  similar  spectra  of 
disease[ll].  The  finding  that  E.  coli  was  incorrectly  assigned  in  two  individuals  is 
therefore  not  a  surprising  one  since  as  well  as  sharing  the  habitats  of  these  opportun¬ 
ists,  E.  coli  frequently  behaves  as  one  in  individuals  whose  immune  systems  are  com¬ 
promised  due  to  surgery,  tracheostomy,  catheterization  or  renal  dialysis  [11]. 

Examination  of  the  distribution  of  the  demographic  variables  across  the  four 
output  groups  (Table  1)  demonstrated  that  there  were  subtle  differences  in  age,  blood 
type  and  gender  between  the  four  output  groups,  which  had  clearly  allowed  the  intelli¬ 
gent  system  to  differentiate  between  them.  Of  these  variables,  blood  type  distribution 
was  the  most  significantly  different  within  the  patient  population  and  between  the  four 
output  groups  (Table  1).  In  the  general,  healthy  human  population,  regardless  of  racial 
origin,  approximately  40%  of  individuals  will  be  of  blood  type  A,  5%  of  type  AB, 
10%  of  type  B  and  45%  of  type  O  [12],  In  contrast,  the  hospitalized  population  dem¬ 
onstrated  a  decrease  in  expected  frequency  of  blood  type  A  (40%  to  33%)  and  an 
increase  in  frequency  of  blood  type  B  (10  to  16%),  with  types  AB  and  O  having  values 
as  expected  (Table  1). 

Differential  distributions  of  blood  type  frequency  were  also  observed  when 
the  output  classes  of  staphylococcal  and  streptococcal  infection  were  compared  (Table 
1).  These  infections  constituted  the  majority  of  cases  in  the  medical  database 
(104/187;  56%)  and  thus  would  be  expected  to  provide  the  most  permutations  in  terms 
of  demographic  variable  combinations.  In  addition,  the  groups  were  both  microbio- 
logically  similar,  since  both  are  gram-positive  organisms,  with  similar  demographic 
variable  distributions  of  gender  and  race.  Only  one  variable,  apart  from  blood  type, 
was  different  between  the  two  classes;  the  streptococcal-infected  patients  represented 
a  somewhat  younger  group  being  mostly  below  the  age  of  60  (Table  1).  Both  groups 
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demonstrated  an  increased  frequency  of  blood  type  B  (22%)  above  that  expected  for 
both  the  patient  population  (16%)  and  the  general  healthy  population  (10%)  and  a 
decrease  in  frequency  of  blood  type  A  commensurate  with  the  general  population 
(33%;  Table  1).  Staphylococci-infected  patients  had  similar  frequencies  of  blood  types 
AB  (3%)  and  O  (44%)  as  both  the  general  patient  and  normal  healthy  populations.  In 
contrast,  patients  with  streptococcal  infections  demonstrated  an  increased  frequency  of 
individuals  with  type  AB  (10%)  and  a  decrease  in  the  frequency  of  type  O  (30%;  Ta¬ 
ble  1) 

Patients  with  E.  coli  and  non-E.  coli  GNR  infections  were  also  found  to  have 
a  differential  blood  type  distribution  from  the  general  patient  population  (Table  1). 
The  organisms  from  these  two  output  groups  shared  microbiological  features  in  com¬ 
mon  but  differed  demographically  since  the  E.  coli  group  were  significantly  older  than 
the  general  patient  population  distribution  (71%  over  60  years)  and  were  predomi¬ 
nantly  female  (67%)  (Table  1).  In  spite  of  differences  in  other  demographic  variables, 
both  groups  showed  a  similar  blood  type  distribution  with  an  increased  frequency  of 
blood  type  O,  when  compared  with  both  the  general  healthy  population  and  the  patient 
population  as  a  whole  (53-55%  compared  with  45%;  Table  1). 

Bacterial  infection  is  a  highly  selective  and  dynamic  process  in  which  the 
host  is  targeted  based  on  a  complex  interplay  of  many  factors,  which  are  only  now 
gradually  beginning  to  be  understood.  The  results  of  this  study  suggest  that  the  demo¬ 
graphic  variables  of  blood  type,  gender,  age  and  race  may  be  involved  in  bacterial  host 
selection,  and  that  such  differential  targeting  may  allow  us  to  use  these  variables  as  a 
predictor  of  disease.  In  this  study,  the  size  of  our  population  did  not  allow  us  to  estab¬ 
lish  which  variables  were  the  most  strongly  associated  with  the  bacterial  infectious 
agent.  An  increased  patient  bank  of  data,  which  is  currently  being  generated  courtesy 
of  an  Einstein  Society  Award  from  Albert  Einstein  Medical  Center,  would  allow  fur¬ 
ther  tuning  and  development  of  the  program  to  eliminate  any  variables,  which  prove  to 
be  redundant.  More  patient  data  will  also  allow  the  subdivision  of  the  current  bacterial 
infection  categories  to  single  species,  making  the  program  more  specific.  Indeed,  it  is 
anticipated  that  the  combination  of  currently  used  clinical  algorithms  with  a  user- 
friendly,  simplified  version  of  the  current  program  might  allow  its  eventual  use  by  all 
physicians  to  make  more  accurate  initial  predictions  of  the  bacterial  causative  agent  of 
an  infection. 
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Abstract  A  learned  lesson,  in  the  context  of  a  pre-defined  organizational 
process,  summarizes  an  experience  that  should  be  used  to  modify  that  process, 
under  the  conditions  for  which  that  lesson  applies.  To  promote  lesson  reuse, 
many  organizations  employ  lessons  learned  processes,  which  define  how  to 
collect,  validate,  store,  and  disseminate  lessons  among  their  personnel,  typically 
by  using  a  standalone  retrieval  tool.  However,  these  processes  are  problematic: 
they  do  not  address  lesson  reuse  effectively.  We  demonstrate  how  reuse  can  be 
facilitated  through  a  representation  that  highlights  reuse  conditions  (and  other 
features)  in  the  context  of  lessons  learned  systems  embedded  in  targeted 
decision-making  processes.  We  describe  a  case-based  reasoning 
implementation  of  this  concept  for  a  decision  support  tool  and  detail  an 
example. 


1  Lessons  Learned  Process 

Lessons  learned  (LL)  processes  (Weber  et  al.,  2000b)  are  knowledge  management 
(KM)  solutions  for  sharing  and  reusing  knowledge  gained  through  experience  (i.e., 
lessons)  among  an  organization’s  members.  LL  systems  are  motivated  by  the  need  to 
preserve  an  organization’s  knowledge  and  convert  individual  knowledge  into 
organizational  knowledge  so  that,  when  experts  become  unavailable;  other  employees 
who  encounter  conditions  that  closely  match  some  lesson’s  context  may  benefit  from 
applying  it.  Therefore,  a  lesson  learned  is  a  validated  working  experience  that,  when 
applied,  can  positively  impact  an  organization’s  processes.  While  some  organizations 
can  quickly  update  the  processes  targeted  by  lessons,  thus  eliminating  the  need  for  a 
repository  of  lessons,  other  organizations  (e.g.,  the  US  military,  the  Department  of 
Energy)  do  not  have  this  luxury  (i.e.,  they  cannot  easily  update  their  processes),  which 
necessitates  using  LL  systems  to  explicitly  store  and  retrieve  lessons. 

LL  systems  are  ubiquitous;  we  easily  located1  over  40  of  them  on  the  WWW,  are 
aware  that  many  others  are  used  in  private  industry,  and  discovered  that  they  rarely 
succeed  in  promoting  knowledge  reuse/sharing  for  two  reasons  (Weber  et  al.,  2000b). 
First,  the  selected  representations  of  lessons  typically  are  not  designed  to  facilitate 
reuse,  either  because  they  do  not  clearly  identify  the  process  to  which  the  lesson 
applies,  its  contribution  to  that  process,  or  its  pre-conditions  for  application.  Second, 


1  Our  compiled  findings  are  posted  at  www.aic.nrl.navy.mil/~aha/lessons. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISM1S  2000,  LNAI 1932,  pp.  358-367, 2000. 
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these  systems  are  usually  not  integrated  into  an  organization’s  decision-making 
process,  which  is  the  primary  requirement  for  any  solution  to  successfully  contribute 
to  KM  activities  (Reimer,  1998;  Leake  et  al.,  1999;  Aha,  1999). 

KM  solutions  usually  involve  both  organizational  dynamics  and  technological 
components.  We  propose  a  technological  solution  to  designing  LL  systems  that 
includes  a  lesson  representation  chosen  to  potentiate  knowledge  sharing  in  an 
embedded  system  in  which  lessons  are  proactively  brought  to  the  attention  of  users.  In 
the  remainder  of  this  paper  we  summarize  research  on  LL  systems,  introduce  a 
representation  for  lessons  that  promotes  knowledge  sharing,  discuss  the  lessons 
learned  process,  describe  the  design  of  an  active  lessons  delivery  system,  and  detail  an 
example  of  its  use  as  a  module  in  HICAP2  (Munoz-Avila  et  al.,  1999),  a  decision 
support  tool  for  interactive  plan  authoring. 


2  Related  Work 

Although  dozens  of  lessons  learned  centers  and  their  respective  systems  exist,  few 
researchers  have  addressed  LL  systems,  and  almost  none  in  artificial  intelligence.3 
This  is  somewhat  surprising,  given  that  their  developers  and  users  overwhelmingly 
agree  that  current  LL  systems  are  insufficient.  That  is,  there  are  several  unanswered 
research  issues  regarding  intelligent  LL  systems  that  need  to  be  addressed. 

Several  KM  publications  have  reported  on  issues  related  to  lessons  learned 
systems  (van  Heijst  et  al.,  1996;  O'Leary,  1998;  Secchi,  1999;  Habbel  et  al.  1999, 
SELLS,  1999).  However,  few  of  these  discussed  topics  related  to  intelligent  systems 
(e.g.,  van  Heijst  et  al.  (1997)  stress  the  relationship  between  case-based  reasoning 
(CBR)  and  LL  systems).  The  only  deployed  application  that  uses  CBR  technology  is 
NASA’s  RECALL  system  (Sary  &  Mackey,  1995),  although  three  research  groups 
have  recently  proposed  CBR  approaches  that  promote  knowledge  sharing. 

First,  the  Air  Campaign  Planning  Advisor  (ACPA)  (Johnson  et  al.,  2000) 
disseminates  videotaped  stories  (e.g.,  best  practices)  in  a  planning  environment. 
However,  ACPA  does  not  reason  on  the  stories,  nor  highlight  reuse  components  or 
conditions.  Thus,  the  user  must  decide  whether  or  not  to  apply  the  memory  captured 
in  the  story  according  to  their  interpretation  of  it. 

Second,  CALVIN  (Leake  et  al.,  2000)  captures  lessons  concerning  which  online 
information  resources  should  be  searched  for  a  given  research  topic.  The  subject  and 
research  results  are  used  to  index  lessons  so  that  when  a  user  starts  a  search, 
previously  stored  results  are  proactively  brought  to  the  user’s  attention.  Unlike  most 
LL  systems,  CALVIN  is  task-specific  rather  than  organization-specific. 

Finally,  we  propose  the  Active  Lessons  Delivery  System  (ALDS),  whose 
implementation  in  HICAP  is  discussed  and  exemplified  in  this  paper.  Users  can 
interact  with  HICAP  to  author  plans  by  iteratively  decomposing  complex  tasks  into 
primitive  actions.  ALDS  monitors  changes  in  the  plan  and  plan  state  (i.e.,  described 
by  a  set  of  <question,  answer>  pairs),  and  triggers  a  lesson  when  its  applicable  task 


2  For  more  information  and  demonstrations  of  HICAP  and  ALDS,  both  developed  in  Java  1.2, 

please  see  http://www.aic.nrl. navv.mil/hicap. 

3  This  motivated  us  to  organize  the  AAAI’00  Intelligent  Lessons  Learned  Workshop,  whose 
homepage  is  www.aic.nrl.navy.mil/AAAI00-ILLS-Workshop. 
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matches  a  task  in  the  (evolving)  plan  and  its  conditions  closely  match  the  plan  state. 
ALDS  differs  from  the  previous  two  embedded  architectures  in  that  (1)  it  focuses 
specifically  on  organizational  lessons  in  the  context  of  planning  tasks,  (2)  it 
automatically  determines  a  triggered  lesson’s  interpretation  for  the  evolving  plan,  and 
(3)  it  allows  users  to  automatically  implement  a  lesson  by  pressing  a  button. 


3  Lessons  Learned  Knowledge  Representation 

In  our  survey  of  LL  systems  (Weber  et  al.,  2000b),  we  found  that  lessons  are  often 
represented  inadequately,  preventing  them  from  being  easily  reused  or  understood. 
For  example,  recorded  lessons  often  do  not  highlight  the  task  for  which  they  apply,  or 
precisely  specify  their  triggering  conditions.  Also,  free  text  representations,  which  are 
used  in  all  the  deployed  LL  systems  we  have  found,  complicate  reuse  because  this 
text  has  to  be  correctly  interpreted  to  ensure  proper  lesson  reuse. 

A  lesson  is  derived  from  an  experience  in  which  the  result  derived  from  applying 
an  originating  action  yields  significant  new  knowledge  (i.e.,  a  contribution),  due  to  a 
success  or  failure,  that  can,  and  should,  be  taught  to  others.  A  lesson’s  conditions  for 
reuse  are  the  relevant  state  variables  that  existed  when  the  originating  action  occurred. 
An  ideal,  validated  lesson  facilitates  its  dissemination  by  clearly  stating  its 
contribution  and  the  decision,  task,  or  process 4  for  which,  by  applying  its 
recommended  response  action  (i.e.,  a  suggestion),  a  user  can  reduce  or  eliminate  the 
potential  for  failures  or  mishaps,  or  reinforce  a  positive  result.  In  more  detail,  the 
features  of  a  lesson  that  target  improvements  to  planning  tasks  are: 

Originating  action:  The  action  taken  in  the  lesson’s  initiating  experience. 

Result:  This  indicates  whether  the  experience  was  positive  or  negative,  and  helps  to 
determine  whether  to  recommend  repeating  or  avoiding  the  same  experience. 

Lesson  contribution:  This  is  the  crucial  feature  (e.g.,  a  set  of  constraints)  that 
characterizes  the  originating  action  and  is  responsible  for  the  result  of  the  original 
experience.  The  lesson’s  contribution  is  the  element  that  should  be  repeated,  in 
conjunction  with  the  originating  action,  when  the  experience  has  a  positive  result,  and 
it  should  be  avoided  when  the  result  is  negative. 

Applicable  task:  This  is  a  pre-defined  task  in  an  organization's  targeted  planning 
process.  The  lesson  author  must  identify  the  task  to  which  the  lesson  is  applicable. 
Conditions  for  reuse:  These  are  the  values  of  the  state  that,  when  matched  closely, 
will  cause  a  lesson  to  be  reused.  Knowledge  for  identifying  and  assessing  similarity 
between  conditions  and  state  variables  must  be  elicited  from  domain  experts. 
Suggestion:  This  is  the  recommended  response  action.  It  is  entailed  by  a  lesson’s 
other  features  (i.e.,  a  negative  experience  should  be  avoided)  and  provided  by  the 
lesson  author. 


4  In  decision-making  systems,  lessons  are  applicable  to  decisions.  In  planning,  lessons  are 
applicable  to  tasks. 
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We  illustrate  this  representation  with  a  lesson  from  the  Joint  Unified  Lessons  Learned 
System5  concerning  non-combatant  evacuation  operations  (Section  5.2).  This  lesson 
refers  to  the  step  in  which  non-combatants  had  to  be  registered  prior  to  evacuation  in 
a  disaster  relief  operation  after  the  April  1991  eruption  of  Mt.  Pinatubo  in  the 
Philippines.  The  lesson’s  summary  is:  The  evacuee  registration  process  was  very  time 
consuming  and  contributed  significantly  to  delays  in  throughput  and  to  evacuee 
discomfort  under  tropical  conditions.  Our  representation  for  this  lesson  is  as  follows: 

Originating  action  ^  Evacuee  registration 

Action  result  ^  Delays,  time  consuming,  and  evacuee  discomfort  -9  negative 
Contribution  ^  Triple  registration  process  is  problematic 
Applicable  task  ^  Evacuee  registration 

Conditions  ^  Under  tropical  conditions 

Suggestion  ^  Locate  an  INS  ( Immigration  and  Naturalization  Service) 

screening  station  at  the  initial  evacuation  processing  site. 
Evacuees  are  required  to  clear  INS  procedures  prior  to 
reporting  to  the  evacuation  processing  center. 

This  lesson  refers  to  a  negative  outcome  (e.g.,  evacuee  discomfort).  The 
expression  “under  tropical  conditions”  is  a  condition  for  reuse.  In  this  lesson,  the 
applicable  task  is  the  same  as  the  originating  action,  although  this  is  not  true  for  all 
lessons.  The  lesson  recommends  an  alternative  method  of  registration  that  is  not  time 
consuming,  which  defines  its  suggestion. 

Because  a  lesson  may  still  be  applicable  even  when  its  conditions  are  not  perfectly 
matched  by  the  state,  reusing  lessons  using  a  CBR  approach  is  appropriate.  The 
similarity  assessment  between  conditions  and  state  variables  is  modeled  using  elicited 
expert  knowledge.  Adaptation  (e.g.,  replacing  tropical  condition s  with  winter 
conditions)  is  not  supported  because  the  user  must  decide  whether  to  apply  the 
recommended  suggestion.  The  only  feature  that  can  be  inferred  is  the  suggestion, 
from  information  embedded  in  the  originating  action,  lesson  contribution,  and  result. 

In  the  implementation  of  ALDS  in  HICAP,  the  applicable  planning  task  and 
conditions  are  used  for  indexing  a  lesson..  To  improve  retrieval  and  consequently 
improve  reuse,  an  effective  indexing  should  anticipate  the  end  users’  needs  and 
indexing  style  (Kolodner,  1993).  Therefore,  a  different  indexing  strategy  is  required 
to  facilitate  retrieval  of  lessons  that  target  technical  decision  making.  This  indexing 
strategy  should  use  an  expert’s  model  so  that  technicians  can  identify  the  model 
component  targeted  by  the  lesson  (instead  of  identifying  an  applicable  task)  and  other 
features  (e.g.,  the  problem,  its  causes,  and  the  symptoms  associated  with  that 
component). 

4  Lessons  Learned  Process 

In  Section  1  we  identified  two  problems  with  traditional  lessons  dissemination 
approaches:  lesson  representations  that  do  not  promote  reuse  and  standalone  retrieval 
tools.  In  Section  3  we  proposed  a  representation  that  facilitates  lesson  reuse.  This 
section  focuses  on  embedding  LL  systems  in  their  targeted  processes. 

An  organization’s  lesson  learned  process  typically  involves  the  following  tasks: 
collecting,  validating,  storing,  disseminating,  and  reuse.  For  example,  military 

5  https://www-secure.jwfc.acom.mil/protected/jcll. 
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organizations  request  their  members,  after  completing  a  mission,  to  submit  lessons  to 
a  LL  center ,  where  they  are  analyzed,  indexed  according  to  a  task  list  specific  to  that 
branch  of  the  armed  services,  validated,  and  stored  in  a  repository.  Lesson  repositories 
are  provided  to  military  personnel,  and  are  accessible  on  the  secure  military  network 
SIPRNET  and  also  on  CD-ROMs.  An  accompanying  search  engine  is  used  to  submit 
queries  in  the  hope  of  retrieving  relevant  lessons.  Thus,  LL  centers  are  responsible  for 
collecting,  validating,  storing,  and  disseminating  lessons  so  potential  users  can  reuse 
them.  These  five  steps  summarize  the  standard  LL  process,  which  varies  slightly 
among  LL  centers. 

Most  systems  for  lesson  retrieval  are  standalone  and  passive,  and  thus  ill  suited  for 
promoting  lesson  dissemination  and  reuse  because  they  require  users  to  master  a  new 
process  (i.e.,  search  for  relevant  lessons  in  a  separate  standalone  LL  retrieval  tool) 
that  is  independent  of  their  problem-solving  task.  In  fact,  this  process  makes  several 
unrealistic  assumptions:  it  assumes  that  a  user  is  reminded  of  the  potential  utility  of  a 
LL  system  whenever  it  may  be  useful,  knows  that  the  system  exists,  knows  where  to 
find  it,  has  the  time  and  the  skills  to  use  it,  and  can  correctly  interpret  and  reuse 
retrieved  lessons. 

We  identified  two  desired  characteristics  of  a  LL  process  for  facilitating 
knowledge  sharing.  First,  it  must  deliver  lesson  knowledge  during  process  execution 
(e.g.,  business,  planning)  to  support  decision-making.  Second,  it  must  be  embedded  in 
the  process  targeted  by  a  lesson.  An  embedded  LL  system  should  monitor  this 
process,  identify  changes  in  the  plan  state,  recognize  when  a  lesson  is  applicable  to  the 
current  decision  or  task  (i.e.,  when  the  conditions  of  the  lesson  and  plan  state  match), 
and  proactively  highlight  relevant  lessons  to  the  user  (Figure  1).  This  process  will 
allow  a  user  to  incorporate  a  relevant  lesson’s  suggestion,  which  can  potentially 
modify  the  user’s  decision-making.  Thus,  this  active  delivery  process  promotes 
embedding  knowledge  reuse  into  the  decision-making  process. 


Fig.  1  Proposed  lessons  learned  process. 

These  observations  motivated  us  to  design  an  active  lessons  delivery  approach,  to 
be  embedded  in  a  user’s  decision  support  tool.  By  automatically  bringing  relevant 
lessons  to  the  user’s  attention,  it  promotes  lesson  reuse  by  reducing  the  burden  on  the 
user.  In  particular,  this  process  can  clarify  how  a  lesson  is  relevant  to  the  user’s 
current  decision-making  task  by  reducing  or  eliminating  problems  of  lesson 
interpretation  and  selection,  does  not  require  the  user  to  consult  a  separate  LL  system, 
should  increase  the  precision  and  recall  of  lesson  retrieval,  and  should  allow  users  to 
automatically  incorporate  a  triggered  lesson’s  suggestion  into  the  evolving  plan. 
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5  An  Active  Lessons  Delivery  System 

An  embedded  active  lessons  delivery  module  monitors  a  decision-making  process, 
bringing  lessons  to  the  user’s  attention  when  they  become  relevant.  The  primary 
constraint  on  the  embedding  decision  support  tool  is  that  it  represents  and  maintains 
information  on  this  process  that  can  be  used  to  index  appropriate  lessons  (e.g., 
lesson’s  task  and  triggering  conditions).  We  illustrate  this  active  lessons  delivery 
approach  for  a  military  planning  process  in  the  context  of  HICAP.  The  following 
subsections  introduce  HICAP  and  detail  an  example  that  illustrates  the  use  of  ALDS. 

5.1  The  Decision  Support  Tool:  HICAP 

HICAP  (Hierarchical  Interactive  Case-based  Architecture  for  Planning)  (Breslow  et 
al.,  2000)  helps  users  to  formulate  a  hierarchical  plan,  which  is  represented  as  a  tuple 
P  =  {T,R,A}.  T={T,<f}  is  a  hierarchical  task  network  (HTN),  where  each  task  re 7" is 
defined  by  its  name  tn  and  duration  trl,  the  relation  <  defines  a  (partial)  temporal 
ordering  on  tasks,  and  rAr  means  that  t  is  a  parent  of  t  in  T.  The  leaves  of  T  comprise 
the  primitive  actions  to  be  included  in  the  plan.  R,  which  is  also  represented  using  an 
HTN,  is  the  plan’s  set  of  resources.  Finally,  A  is  a  set  of  assignments  between  the 
plan’s  tasks  and  resources.  Also  of  interest  is  S={<q,a>+),  which  denotes  state 
information  in  the  form  of  a  set  of  <question,answer>  pairs. 

HICAP’s  modules  include,  among  others,  a  Hierarchical  Task  Editor  (HTE)  that 
allows  users  to  edit  a  plan,  a  conversational  case  retriever  (NaCoDAE/HTN)  that 
allows  users  to  interactively  select  a  stored  decomposition  to  apply  to  a  task  in  T,  and 
a  generative  planner  (JSHOP)  that  can  be  selected  to  automatically  decompose  tasks 
in  T  into  subtasks.  The  plan  state  S  is  updated  by  direct  user  input,  through  user 
interactions  with  NaCoDAE/HTN,  or  by  JSHOP. 

5.2  The  Task  Domain:  Noncombatant  Evacuation  Operations  (NEOs) 

We  initially  designed  HICAP  for  deliberative  NEO  planning;  no  AI  system  has  been 
deployed  to  assist  military  experts  to  plan  NEOs.  NEOs  (DoD,  1994)  are  performed 
by  the  US  military  to  assist  in  the  evacuation  of  non-combatants,  non-essential 
military  personnel,  and  others  (e.g.,  host  nation  citizens)  whose  lives  are  in  danger 
(e.g.,  due  to  political  insurgencies,  volcanic  eruptions)  from  an  endangered  location 
(e.g.,  a  beleaguered  US  embassy)  to  an  appropriate  safe  haven. 

Each  lesson  in  HICAP  is  indexed  by  its  applicable  task  and  conditions.  For 
example,  one  such  lesson  for  the  NEO  planning  domain  is: 

Originating  action  rt  Assign  conventional  use  of  air  wing 

Action  result  ^  Increases  the  risk  to  detection  of  clandestine  SOF  negative 
Contribution  ^  Conventional  (low  visibility)  air  wing  increases  SOF  risk 
Applicable  task  ^  Assign  air  wing 

Conditions  ^  Q:  Is  it  necessary  to  use  covert  SOF  helicopters?  A:  Yes 

Suggestion  ^  Assign  high  visibility  to  conventional  air  wing 
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A  lesson’s  conditions  are  represented  as  <question,answer>  pairs  so  their 
similarity  with  state  variables  can  be  easily  assessed.  The  user  decides  whether  and 
how  the  lesson’s  suggestion  will  be  implemented,  as  illustrated  below. 

5.3  Active  Lessons  Delivery  Module:  An  Example 

For  this  example,  we  use  the  fictitious  Terror  in  the  Jungle  NEO  scenario,  obtained 
from  the  DISA  Adaptive  Courses  of  Action  ACTD.6  Some  tasks  in  its  task  hierarchy 
can  be  further  decomposed  using  interactive  case  retrieval.  After  the  user  selects  a 
task  to  expand,  NaCoDAE/HTN  displays  alternative  expansions  that  could  apply, 
along  with  questions  that,  if  answered  by  the  user,  could  help  determine  which  case’s 
conditions  best  matches  S.  The  task  being  expanded  here  is  Rescue  mission ,  which 
concerns  how  to  safely  evacuate  the  evacuees. 

After  answering  some  questions  and  thus  updating  the  state,  the  case  retriever  then 
displays  the  question  Is  it  necessary  to  use  covert  SOF  helicopters?  The  user  answers 
Yes,  yielding  a  perfect  match  with  a  task  decomposition  case  that  expands  to  the 
subtasks  Use  ground  support  and  Assign  conventional  use  of  air  wing. 


Fig.  2. A  lesson  pertaining  to  camouflaging  special  operations  forces. 

When  expanding  these  tasks,  ALDS  recognizes  that  a  lesson  applies  (i.e.,  the  user 
had  indicated  the  need  to  use  Special  Operations  Forces  (SOF)  helicopters  for  the 
evacuation)  and  displays  it  (Figure  2).  This  lesson,  which  is  applicable  to  the  task 
Assign  conventional  use  of  air  wing ,  suggests  replacing  this  task  with  Assign  high 
visibility  to  air  wing.  Figure  3  displays  the  resulting  task  hierarchy.  The  meaning  of 
this  lesson  is  that  military  protocol  dictates  that  SOF  forces  should  be  made  less 
conspicuous  whenever  they  are  deployed.  In  this  example,  a  high-visibility  air  wing, 
composed  of  conventional  forces,  will  more  easily  hide  the  SOF  forces. 


6  http://www.les.disa.mil/insert/acoa/index.htm 
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Fig.  3.  A  subset  of  the  task  hierarchy  after  applying  the  lesson  shown  in  Figure  2. 


6  Concluding  Remarks  and  Future  Work 


In  this  paper  we  focused  on  the  reuse  of  lessons  learned.  We  identified  two  problems 
that  interfere  with  lesson  reuse:  inadequate  lesson  representations  (e.g.,  how  different 
features  should  be  highlighted  to  enable  interpretation)  and  system  architecture  (i.e., 
how  lessons  learned  systems  should  be  embedded  into  the  decision-making  process). 
We  then  proposed  an  active  lessons  delivery  approach  (ALDS)  to  address  these 
problems  and  exemplified  its  use  in  HICAP,  a  plan  authoring  tool. 

We  have  not  yet  evaluated  the  utility  of  ALDS  in  NEO  exercises,  and  instead 
developed  a  simple  travel  planning  domain  for  evaluating  the  impact  of  ALDS 
(Weber  et  al.,  2000a).  In  future  work,  we  will  examine  how  to  use  HICAP  to  guide 
interactive  lesson  elicitation,  demonstrate  the  utility  of  active  lessons  delivery  for 
other  decision  support  tasks,  and  transition  HICAP  to  the  ACOA  ACTD  project. 
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Abstract.  The  problem  addressed  by  Mediation  to  Implement  Feed¬ 
back  in  Training  (MIFT)  is  to  customize  the  feedback  from  training 
exercises  by  exploiting  knowledge  about  the  training  scenario,  training 
objectives,  and  specific  student /teacher  needs.  We  achieve  this  by  insert¬ 
ing  an  intelligent  mediation  layer  into  the  information  flow  from  observa¬ 
tions  collected  during  training  exercises  to  the  display  and  user  interface. 
Knowledge  about  training  objectives,  scenarios,  and  tasks  is  maintained 
in  the  mediating  layer.  A  designer  constraint  is  that  domain  experts  must 
be  able  to  extend  mediators  by  adding  domain-specific  knowledge  that 
supports  additional  aggregations,  abstractions,  and  views  of  the  results 
of  training  exercises. 

The  MIFT  mediation  concept  is  intended  to  be  integrated  with  exist¬ 
ing  military  training  exercise  management  tools  and  reduce  the  cost  of 
developing  and  maintaining  separate  feedback  and  evaluation  tools  for 
every  training  simulator  and  every  set  of  customer  needs.  The  MIFT 
Architecture  is  designed  as  a  set  of  independently  reusable  components 
which  interact  with  each  other  through  standardized  formalisms  such 
as  the  Knowledge  Interchange  Format  (KIF)  and  Knowledge  Query  and 
Manipulation  Language  (KQML). 


1  Mediation  applied  to  military  exercise  management 

The  initial  application  of  MIFT  is  the  Exercise  Analysis  and  Feedback 
phase  of  military  exercise  management  as  schematically  shown  in  Figure 
1.  More  precisely,  the  focus  is  on  simulation-based  army  training  exercises 
[1].  MIFT  handles  some  of  the  information  flows  involved  in  training 
exercise  management.  The  intent  of  MIFT  is  to  supplement  the  flow  of 
information  from  simulations  to  evaluation  and  review  and  complete  a 
feedback  loop  by  supplying  information  to  plan  and  tailor  future  training 
exercises. 

MIFT  processes  the  data  that  is  logged  during  training  exercises  and 
uses  scenario  information  and  domain  knowledge  to  organize  the  data 
from  the  exercises  in  ways  that  are  meaningful  and  useful  for  the  Ob¬ 
server/Controllers  (O/Cs)  managing  the  exercises,  trainees,  commanders, 
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exercise  evaluators,  and  others  interested  in  the  results  of  training  exer¬ 
cises.  MIFT  is  designed  to  feed  information  to  other  software  systems 
that  generate  training  scenarios  and  help  commanders  plan  future  train¬ 
ing  exercises  tailored  to  the  needs  of  their  trainees.  The  MIFT  design  is 
intended  to  integrate  with  other  exercise  management  applications  (see 
Figure  3)  and  achieve  two  key  application  goals  for  exercise  feedback: 

1.  The  software  is  easy  to  use  by  using  domain-specific  exercise  concepts 
and  terminology. 

2.  Domain  experts  are  able  to  extend  feedback  software  and  tailor  it  to 
domain-specific  and  local  needs. 


Exercise  PLanning  Simulations 


Fig.  1.  MIFT’s  mediators  supplement  the  flow  of  information  from  simulations  to 
evaluation  and  review  and  complete  the  feedback  loop  by  supplying  information  to 
plan  and  tailor  future  training  exercises. 


MIFT  will  achieve  the  first  goal  by  incorporating  knowledge  about  the 
scenario  objectives  and  the  task  and  subtasks  to  be  trained.  MIFT  uses 
this  scenario  knowledge  to  relate  simulation  results  to  the  objectives  and 
tasks  to  be  trained  so  that  O/Cs,  trainees,  and  commanders  can  query  the 
simulation  results  using  scenario-based  terminology.  For  example,  rather 
than  forcing  the  O/C  to  formulate  a  query  to  “select  all  enemy  detections 
of  Alpha  company  before  an  assault,”  the  O/C  can  simply  ask  whether 
Alpha  company  achieved  its  scenario  subtask  of  remaining  hidden  until 
the  beginning  of  the  attack.  The  mediator  will  know  that  enemy  detec¬ 
tions  before  the  attack  are  evidence  that  the  unit  was  not  successful  in 
remaining  hidden.  In  general,  MIFT  produces  results  tailored  to  the  needs 
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of  exercise  planners,  weapons  designers,  and  tactics  developers.  The  sec¬ 
ond  goal  of  a  mediator-based  architecture  is  to  enable  military  training 
and  support  personnel  to  tailor  and  extend  analysis  and  feedback  software 
to  meet  their  own  local  needs  [6].  Figure  1  illustrates  MIFT’s  mediators 
to  supplement  the  flow  of  information  from  simulations  to  evaluation  and 
review  and  complete  the  feedback  loop  by  supplying  information  to  plan 
and  tailor  future  training  exercises. 

2  Mediation  Technology 

A  mediator  is  a  software  module  that  exploits  encoded  knowledge  about 
certain  sets  or  subsets  of  data  to  create  information  for  a  higher  layer 
of  applications  [8]  [9].  It  should  be  small  and  simple,  so  that  it  can  be 
maintained  by  one  expert  or,  at  most,  a  small  and  coherent  group  of  ex¬ 
perts.  The  first  step  in  developing  a  mediation  architecture  for  training 
feedback  is  to  isolate  the  mediators  from  lower-level  data  sources  and 
from  higher- level  user  interface  and  application  code  [5].  This  will  enable 
mediators  to  achieve  the  role  of  a  reusable  middleware.  Mediators  inter¬ 
act  with  each  other  through  the  standardized  knowledge  exchange  and 
communications  protocols.  We  have  used  standard  knowledge  exchange 
and  communications  protocols  based  on  Knowledge  Interchange  Format 
(KIF)  [4]  and  Knowledge  Query  and  Manipulation  Language  (KQML)  [3] 
so  that  mediators  can  work  with  data  from  multiple  knowledge  sources 
and  supply  information  that  is  reusable  in  multiple  roles.  The  MIFT  me¬ 
diation  architecture  combines  plug-in  components  at  three  levels: 

1.  User  interfaces  that  accept  information  from  mediators  and  provide  a 
standard  set  of  display  options. 

2.  Mediators  that  use  scenario-based  knowledge  to  analyze,  transform, 
query,  and  present  simulation  results.  A  mediator  supports  numerous 
modules  which  are  relatively  small  components.  A  module  is  a  collec¬ 
tion  of  rules  reflecting  the  domain  knowledge  functionalities.  Domain 
experts  extend  the  analysis  functionality  by  adding  domain  knowledge 
to  mediators  or  by  plugging  in  additional  modules. 

3.  Wrappers  connect  MIFT  with  the  output  formats  of  operational  simu¬ 
lators.  Currently  wrappers  are  tailored  for  JANUS  and  SimNet/LEAF 
data. 

3  Implementation  and  Functionalities 

The  current  MIFT  user  interface  is  built  on  Web  browsers,  hence  enabling 
a  multiple  platform  execution.  In  other  words,  the  MIFT  user  interface 
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can  run  at  any  location  that  supports  Web  browsing;  the  user  does  not 
have  to  download  the  simulation  data.  An  innovation  of  the  user  interface 
is  that  it  is  designed  to  display  information  received  from  a  mediator. 
Users  connect  to  MIFT  and  the  underlying  exercise  results  by  using  a 
Java-capable  browser.  Building  the  user  interface  in  a  browser  has  several 
advantages: 

1.  Users  can  access  exercise  results  in  the  same  way  they  access  other 
information  from  local  and  remote  sources.  The  user  interface  will  be 
increasingly  familiar  to  O/Cs  and  trainees. 

2.  The  exercise  data  may  be  local  or  remote.  Startup  and  initialization  is 
simple.  Users  do  not  have  to  download  and  manage  the  exercise  data. 

A  key  benefit  of  mediators  for  military  training  applications  is  that 
they  avoid  the  need  for  each  simulation  program  having  to  build  from 
scratch  and  maintain  a  separate  set  of  analysis  and  feedback  software 
packages. 

The  operations  referenced  by  the  mediator  can  be  layered  in  the  di¬ 
rection  of  the  data-to-knowledge  aggregation  as  shown  in  Figure  2.  For 
example,  the  first  two  levels  in  the  mediator  perform  standard  aggrega¬ 
tions,  selections,  and  analyses  on  the  data  sources.  We  have  implemented 
these  two  levels  to  provide  a  basic  level  of  functionality  for  higher  levels. 
The  third  level  in  the  mediator  uses  knowledge  of  the  training  scenario 
so  that  O/Cs  and  trainees  can  obtain  feedback  about  how  well  specific 
scenario  tasks  have  been  performed.  The  mediator  allows  users  to  obtain 
specific  feedback  without  having  to  understand  the  structure  of  the  under¬ 
lying  data.  A  planned  fourth  level  in  the  mediator  will  use  domain-specific 
models  about  the  exercise,  the  scenario,  and  causal  relationships  in  the 
exercise  to  analyze  the  data  for  its  probable  significance  and  automati¬ 
cally  call  the  users’  attention  to  what  it  perceives  as  the  more  relevant 
exercise  results.  It  is  useful  to  think  of  the  mediator  as  composed  of  three 
parts: 

1.  Data  from  disparate  sources  are  converted  into  object  instances  over 
which  inferences  can  be  performed. 

2.  Knowledge  about  the  application  domain  is  maintained  in  declarative 
representations. 

3.  An  inference  engine  processes  the  knowledge  and  data  sources  to  pro¬ 
duce  higher  level  information  that  is  passed  to  other  mediators  or  to 
the  user  interface  in  a  standardized  form. 

One  of  the  MIFT  functionalities  is  that  an  Observer/Controller  (O/C) 
will  depend  upon  it  during  an  After  Action  Review  (AAR)  or  that  a 
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Fig.  2.  The  operations  envisaged  from  the  mediator  can  be  layered  in  the  direction  of 
the  data-to-knowledge  aggregation. 


trainee  will  use  it  after  the  AAR.  As  similar  MIFT  functionality  will  be 
useful  to  commanders,  exercise  evaluators,  weapons  designers,  and  others, 
but  each  of  these  other  users  is  likely  to  want  a  different  user  interface 
and  additional  mediator  functionality. 

MIFT  uses  wrappers  to  isolate  the  mediators  from  the  specific  data 
formats  and  other  differences  between  simulator  outputs.  When  a  medi¬ 
ator  needs  additional  information,  it  calls  the  appropriate  wrapper.  The 
wrapper  accesses  the  data  and  creates  instances  of  the  appropriate  ob¬ 
jects.  The  current  implementation  includes  wrappers  that  process  the 
outputs  of  Janus  simulation  runs,  and  LEAF1  formated  data  from  Sim- 
Net  results.  We  believe  that  MIFT  functionality  can  be  made  available 
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for  additional  simulators  by  writing  the  appropriate  wrapper  to  process 
simulator  outputs.  Writing  additional  wrappers  requires  programming  ex¬ 
pertise,  but  it  is  not  a  major  undertaking.  Using  MIFT  on  a  different 
simulation  may  also  require  additional  modules  and/or  user  interfaces  to 
provide  new  functionality  appropriate  for  that  simulation.  For  example, 
the  mediator  that  creates  force  ratios  is  more  useful  for  simulations  at  the 
battalion  or  higher  level  and  might  not  have  been  developed  for  analysis 
of  simulations  at  the  company  level. 

3.1  Implementing  a  Programmable  Mediator 

The  architecture  used  for  the  MIFT  mediator  was  based  on  a  system 
that  can  sustain  minimal  first  order  logic  inference  capability.  To  fur¬ 
ther  minimize  development  cost,  the  Mediator  is  finally  written  in  parts 
in  Clips  [7],  a  widely-available  and  easily  portable  expert  system  shell. 
With  little  and  careful  programming,  Clips  was  capable  of  supporting 
networking  [2],  a  forward  or  backward  chainer,  a  unifier,  an  in- memory 
object  oriented  database  and  a  knowledge  base  that  accepts  and  trans¬ 
late  knowledge  in  the  form  of  objects,  rules  and  facts.  The  major  function 
of  the  architecture  is  to  allow  a  temporal  hyper-graph  construction  that 
triggers  modules  to  it  which  will  perform  their  assigned  tasks.  The  major 
five  modules  were  as  follows:  A  Conflict  Resolver  to  maintain  the  truth 
values  in  the  system,  Domain  Modules  or  the  processes  revolving  around 
the  domain  knowledge  of  the  main  requirements,  a  Report  Agent  in  which 
reports  are  generated  and  wrapped  in  KQML  after  the  main  requirements 
are  accomplished,  a  Maintenance  Module  once  some  processes  have  ter¬ 
minated  after  the  data  and  finally  Data  Wrappers  which  perform  the 
necessary  wrapping  to  maintain  a  correct  syntax  for  the  language  in  use. 
Hence,  template  structures  are  not  violated.  This  reduces  tremendously 
the  amount  of  data  to  be  loaded  in  comparision  to  the  amount  that  will 
be  used.  Typically  most  databases  are  collections  of  instance  events  which 
have  a  time  stamp  associated  with  them  and  hence  the  wrappers  are  ca¬ 
pable  of  playing  back  the  databases  as  a  function  of  time.  Wrappers  are 
mostly  written  in  C++  to  suit  the  variety  and  embedded  complexity  of 
the  original  databases. 

Programming  the  MIFT  mediator  as  a  reusable  system  from  task  to 
task  is  performed  by  changing  the  domain  module.  Although  attempt  was 
made  to  make  the  conflict  resolver  generic  in  its  functionalities  among  the 
tasks,  domain  specific  rules  are  used  in  the  module.  The  major  goal  of  the 
conflict  resolver  is  to  identify  the  knowledge  which  might  be  disruptive 
to  the  overall  mediator  operation.  The  domain  expert  rules  were  divided 
under 
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1.  Cyclic  behavior:  where  asserted  events  result  in  cyclic  effects  in  the 
process  of  inference. 

2.  Repetition  and  redundancy:  where  asserted  events  are  redundant  in 
the  databases. 

3.  Constrained  Space:  where  asserted  events  who’s  truth  value  conflicts 
with  prior  asserted  events.  For  example  a  stated  destroyed  tank  ap¬ 
pearing  later  on  in  the  simulation  as  a  functional  unit.  Conflicts  were 
generically  sorted  out  using  deduction  rules  which  eliminates  the  er¬ 
roneous  event. 

4  Conclusion 

This  paper  describes  Mediation  to  Implement  Feedback  in  Training  to 
customize  the  feedback  from  training  exercises  by  exploiting  knowledge 
about  the  training  scenario,  training  objectives,  and  specific  student/teacher 
needs.  We  plan  to  achieve  this  by  inserting  intelligent  mediators  into  the 
information  flow  from  observations  collected  during  training  exercises  to 
the  display  and  user  interface  functionality.  Knowledge  about  training  ob¬ 
jectives,  scenerios,  and  tasks  is  maintained  in  the  mediators.  A  technical 
constraint  is  that  domain  experts  must  be  able  to  extend  mediators  by 
adding  domain-specific  knowledge  that  supports  additional  aggregations, 
abstractions,  and  views  of  the  results  of  training  exercises. 

MIFT  is  intended  to  allow  analysis  and  evaluation  software  to  be 
reused  by  all  of  the  different  consumers  of  simulation  results.  In  addi¬ 
tion  to  trainees,  O/C,  and  commanders,  others  who  need  to  analyze  and 
evaluate  simulation  results  include  exercise  planners,  training  managers, 
weapons  designers,  tactics  developers,  and  doctrine  writers.  MIFT  can 
also  provide  results  to  other  software  applications;  for  example,  software 
used  to  assist  in  exercise  planning  and  preparation  can  use  MIFT  analyses 
of  previous  exercises  to  identify  the  tasks  and  subtasks  that  need  to  be 
emphasized  in  additional  training.  Thus  MIFT  contributes  to  completing 
the  feedback  loop  from  the  results  of  one  simulation  run  into  the  planning 
and  preparation  for  future  training. 

The  Mediator  is  currently  written  in  Clips  6.0  [7],  a  widely-available 
and  easily  portable  expert  system  shell.  Since  user  interface  functions 
and  data  access  functions  are  separated  out  into  other  components,  the 
module  implementations  are  quite  small.  For  example,  the  force  ratio 
computation  for  any  set  and/or  combination  of  units  is  only  four  rules  for 
a  total  of  12  lines.  Most  other  mediators  at  the  current  stage  are  smaller. 
We  believe  that  some  domain  experts  will  be  able  to  write  modules  in 
Clips. 
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Fig.  3.  The  application  Mediation  to  Implement  Feedback  in  Training  (MIFT)  is  the 
Exercise  Analysis  and  Feedback  phase  of  military  exercise  management.  This  figure 
illustrates  the  many  different  simulation  results  and  the  roles  that  MIFT  can  play  by 
implementing  reusable  mediators  that  aggregate,  summarize,  and  analyze  simulation 
results  and  deliver  them  to  various  consumers  in  terms  tailored  to  their  individual 
needs. 
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Abstract.  A  top-down  query  processing  method  for  first  order  deduc¬ 
tive  databases  under  the  disjunctive  well-founded  semantics  (DWFS)  is 
presented.  The  method  is  based  upon  a  characterisation  of  the  DWFS 
in  terms  of  the  Gelfond-Lifschitz  transformation,  and  employs  a  hyper¬ 
resolution  like  operator  and  quasi  cyclic  trees  to  handle  minimal  model 
processing.  The  method  is  correct  and  complete,  and  can  be  guaranteed 
to  terminate  given  certain  mild  constraints  on  the  format  of  database 
rules.  The  efficiency  of  the  method  may  be  enhanced  by  the  applica¬ 
tion  of  partial  compilation,  subgoal  re-ordering,  and  further  constraints 
on  the  format  of  database  rules.  For  finite  propositional  databases  the 
method  runs  in  polynomial  space. 


1  Introduction 

Over  the  last  few  years  there  has  been  a  great  deal  of  interest  in  the  study  of 
semantics  for  deductive  databases  and  logic  programs  [5],  one  of  the  most  promi¬ 
nent  to  emerge  being  the  disjunctive  well-founded  semantics  (DWFS).  This  was 
introduced  by  Brass  and  Dix  [1-4]  as  the  weakest  semantics  which  satisfies  cer¬ 
tain  desirable  properties,  including  the  generalised  principle  of  partial  evaluation. 
In  [2,3]  an  extension  of  the  Gelfond-Lifschitz  transformation  was  employed  to 
give  a  bottom-up  characterisation  of  DWFS.  This  was  then  used  in  the  DIS¬ 
LOP  project  [1]  to  develop  a  bottom-up  method  of  computing  DWFS,  using  the 
(bottom-up)  methods  of  [12]  to  handle  minimal  model  reasoning. 

In  [10]  we  presented  a  characterisation  of  the  DWFS  directly  in  terms  of 
the  Gelfond-Lifschitz  transformation,  and  using  this  derived  a  top-down  method 
of  testing  DWFS  membership  in  propositional  countable  logic  programs.  In  this 
paper  we  extend  these  techniques  to  provide  a  top-down  query  processing  method 
for  first  order  deductive  databases  under  the  DWFS.  Our  method  is  correct  and 
complete,  and  can  be  guaranteed  to  terminate  given  certain  mild  constraints  on 
the  format  of  database  rules.  We  also  consider  how  our  method  can  be  made 
more  efficient  by  the  application  of  partial  compilation,  subgoal  re-ordering  and 
further  restrictions  on  database  rule  format. 

In  Section  3  we  restate  the  bottom-up  characterisation  of  DWFS  given  in  [2,3] 
and  its  re-characterisation  in  terms  of  the  Gelfond-Lifschitz  transformation  [10]. 
In  Section  4  we  (re)introduce  the  concept  of  a  deduction  tree  [7,8]  which  is  based 
on  a  hyperresolution-like  operator,  and  facilitates  top-down  query  processing  in 
positive  databases.  In  Section  5  we  (re)introduce  quasi  cyclic  trees  [10],  which 
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are  variants  of  cyclic  trees  [6-9],  and  enable  us  to  perform  top-down  minimal 
model  reasoning  in  databases  resulting  from  applications  of  the  Gelfond-Lifschitz 
transformation.  In  Section  6  we  combine  deduction  and  quasi  cyclic  trees  to  form 
our  top-down  method,  which  is  presented  by  means  of  an  example.  Sections  7  and 
8  then  examine  the  construction  (traversal)  of  deduction  and  quasi  cyclic  trees, 
and  the  related  termination  and  efficiency  issues.  Finally  Section  9  contains  our 
conclusions  and  suggestions  for  further  research. 

2  Terminology 

Throughout  C  =  {Pq,  Pi, ... ,  Pn,C\ cm}  denotes  a  finite  function  free 
first  order  language.  We  assume  that  {P\,P2,  ■  -  • ,  Pn}  is  the  disjoint  union  of 
EXT(£)  (the  extensional  predicates)  and  INT(£)  (the  intensional  predicates). 
A  positive  (negative)  atom  is  a  formula  of  the  form  P(x)  (-P(x)),  and  informally 
we  write  P(x)  £  EXT(£)  when  P  £  EXT(£),  etc.  R  denotes  the  set  of  positive 
ground  atoms,  ie.,  the  Herbrand  base.  If  I  is  a  set  of  atoms,  then  I  =  {^K  : 
K  £  I}.  If  0  is  a  first  order  formula,  then  V AR{6)  denotes  the  variables  in  6. 

A  deductive  database  T  is  a  finite  set  of  rules  C  of  the  form  Ax  A  A2  A . . .  A  An  A 
-iA„+i  A-iAn+2A. .  .A-i An+h  — »  Pi  VP2V. .  .VPr,  where  each  A*,  Bj  is  a  positive 
atom  and  r  >  0.  antec (C)  —  {Ai,  A2,  ■  ■  ■ ,  An},  conseq (C)  =  {Bi,B2,  ■ .  ■  ,Br} 
and  A r(C)  =  {An+i,An+2, . . . ,  An+h}-  We  make  the  standard  assumptions  that 
V  AR(conseq(C)  U  A(C))  C  VAR  (antec  (C)),  and  that  if  {Bi,  B2,  •  ■  ■  ,Br}  n 
EXT(£)  #  0,  then  {Bu  B2, . . . ,  Br}  C  EXT(P)  and  antec (G)  UA7(C)  =  0. 

T  is  regarded  as  representing  the  set  of  its  ground  instances,  which  we  denote 
by  gr{T).  T  is  positive  iff  A f{C)  =  0  for  each  C  £T. 

As  in  [6-8]  we  assume  the  existence  of  a  set  of  semi-definite  predicates 
SD(£)  C  INT(£)  such  that  for  each  rule  C,  if  P  £  SD{C)  appears  in  conseq(C), 
then  C  is  definite  (ie.,  |conseq(G)|  =  1),  and  each  predicate  appearing  in  the  body 
of  C  is  in  EXT(£)USD(£).  SD(T)  =  {C  £T  :  conseq(C)  =  {B},B  £  SD(jO)}, 
EXT (T)  =  {C  £  T  :  conseq(C)  C  EXT(£)},  and  INT(P)  =  P-EXT(T).  Notice 
that  EXT(T)  consists  of  disjuncts  of  ground  positive  extensional  atoms. 


3  The  disjunctive  well-founded  semantics 


Definition  3.1.  If  C  is  aground  rule,  then pos(C)  =  /\  antec(C)  — >  V  conseq(C). 
If  T  is  ground,  and  NCR,  then  let  T\gN  denote  the  Gelfond-Lifschitz  trans¬ 
formation,  T\gN  =  {pos{C)  :  C  £  T,Af(C)  017  =  0}. 

Theorem  3.2  [2,3].  Let  Do  =  0  and  £>a+i  =  Ba+i  U  D~+1,  where 
D+a+1  =  {yP:VCR,{gr{T)/Da)\gR^\JV}, 

D-+ 1  =  {-.Q  :  Q  £  H,  (VN  C  U)(N  \=  D++1  =>  ( gr(T)/Da)\gN  \=mtn  ->Q)}, 
and  gr(T)/Da  is  formed  from  gr(T)  by  (i)  removing  any  rule  C  for  which  D~f  (= 
V  Af(C),  and  (ii)  for  each  remaining  rule  C,  replacing  Af  (C)  by  —  DL  ■ 
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Then  Do  C  D\  C  Do  C  . . .  grows  monotonically  and  DWFS  =  (Ja  Da. 

Recall  that  Da  =  {K  :  ->K  £  D~}.  T  \ —min  ~^Q  iff  Q  is  false  in  all  minimal 
models  of  T.  Note  that  no  clause  for  limit  ordinals  is  needed  since  TL  is  finite. 

The  construction  of  DWFS  given  above  is  based  upon  computing  a  set  of 
disjuncts  and  negative  atoms,  using  these  to  reduce  the  database  with  the  /- 
operator,  and  then  repeating  the  process.  This  is  ideal  for  a  bottom-up  compu¬ 
tation,  but  for  a  top-down  approach  it  is  essential  to  express  each  of  the  sets 
Da+ 1  directly  in  terms  of  gr{T)  rather  than  gr(T)/Da. 

Theorem  3.3  [10].  D++1  =  {\JV  :  V  C  R,  gr(T)\g(R  -  ~DZ)  \=  \JV],  and 
D~+ 1  =  {-.Q  :  Q  £  ft,  (VA  C  -  DJ)(A  [=  D++1  =>  gr(T)\gN  \=min  ->Q)}. 


4  Computation  in  positive  databases 

In  Theorem  3.3  we  saw  that  the  computation  of  D++1  employs  the  pos¬ 
itive  database  gr(T)\g(R  -  Da)-  Derivability  in  positive  databases  can  be 
characterised  using  deduction  trees  [7,8,10]  which  in  turn  are  based  upon  a 
hyperresolution-like  operator  akin  to  that  employed  in  SLO-resolution  [13]. 

Definition  4.1.  Suppose  that  T  is  positive  and  that  V  is  a  set  of  positive  atoms. 
A  deduction  tree  for  V  in  T  is  a  finite  tree  containing  predicate  nodes  and  rule 
nodes  satisfying  the  following  conditions. 

(i)  The  root  node  (at  the  top  of  the  tree)  is  a  predicate  node  labelled  with  V. 

All  other  predicate  nodes  are  labelled  with  a  positive  atom. 

(ii)  If  A"  is  a  predicate  node,  then  let  ACT(A)  denote  the  set  of  atoms  labelling 
predicate  nodes  at  or  above  A  (on  the  current  branch).  A  has  a  single  child 
node  which  is  a  rule  node  labelled  with  an  instance  CO  of  a  rule  C  €  T 
(written  RNce )  such  that  conseq((70)  C  ACT  (A).  For  each  K  €  antec(C'd), 
RNce  has  a  (predicate)  child  node  labelled  with  K. 

(iii)  Each  leaf  node  is  a  rule  node. 

An  instance  of  T  is  formed  by  applying  some  substitution  to  all  labels  (in¬ 
cluding  rules)  within  the  tree.  Clearly  if  T  is  a  deduction  tree  in  T,  then  so  is 
any  instance  of  T -  A  predicate  node  A  is  redundant  iff  there  is  a  predicate  node 
A'  >  A  such  that  lab(N)  equals,  or  is  contained  in,  lab(N')  [7,8]. 

Theorem  4.2  [7,8].  If  T  is  positive  and  VCR,  then  T  \=  \J  V  iff  V  has  a 
deduction  tree  in  gr(T )  in  which 

(i)  no  predicate  node  is  redundant,  and 

(ii)  if  A  is  a  predicate  node  with  lab(N)  £  SD(£)  and  RNce  is  the  child  node 
of  N,  then  conseq((70)  =  {lab(N)}. 
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Thus  we  see  that  such  trees  may  be  constructed  using  ancestor  pruning  (con¬ 
dition  (i))  and  that  semi-definite  atoms  are  expanded  in  a  linear  fashion.  The 
construction  of  deduction  trees  is  discussed  in  Section  7. 


5  Quasi  cyclic  trees 

In  [6-8]  we  introduced  the  notion  of  a  cyclic  tree  as  a  means  of  testing  min¬ 
imal  model  membership.  The  following  variant  of  this  notion  (adapted  from 
[10])  enables  us  to  characterise  minimal  models  of  databases  of  the  form  T\gN. 
Motivation  for  the  following  definition  appears  in  [6-10]. 

Definition  5.1.  A  quasi  cyclic  tree  T  for  P( t)  in  T  is  a  finite  tree  T  containing 
predicate  nodes  and  rule  nodes  satisfying  the  following  conditions. 

(a)  The  root  node  (at  the  top  of  T)  is  a  predicate  node  labelled  with  P(t). 

(b)  Each  predicate  node  N  is  labelled  with  a  positive  atom,  denoted  by  lab(N), 
and  CYC  (N)  =  {lab(N')  :  N'  is  a  predicate  node,  N'  >  N,  and  3N"  > 
N',N"  is  a  predicate  node,  lab(N")  =  lab(N)}.  Let  Pred(T)  =  {lab(N)  :  N 
is  a  predicate  node  in  T}. 

(c)  A  predicate  node  N  has  at  most  a  single  child  node  which  (if  it  exists)  is  a 
rule  node  labelled  with  an  instance  CO  of  a  rule  C  £T  (written  RNce)  such 
that  conseq(C'6»)n  CYC(IV)  ^  0,  antec(C6i)nCYC(Y)  =  0,  and  0{RNCe)  = 
(conseq(C0)  —  GYG{N))  is  disjoint  from  Pred(T).  For  each  K  £  antec (CO), 
RNce  has  a  (predicate)  child  node  labelled  with  K. 

(d)  If  IV  is  a  predicate  node  with  lab(N)  £  SD(£),  then  N  is  not  redundant. 
We  define  A f(T)  =  \J{M{C0)  :  RNce  is  a  rule  node  in  T),  and  0(T)  = 

U {O(RNce)  ■  RNce  is  a  rule  node  in  T}.  As  in  [6-10],  an  unfactored  quasi 
cyclic  (UQC)  tree  is  a  quasi  cyclic  tree  in  which  each  leaf  node  is  a  rule  node. 

Notice  that  condition  (c)  is  inherently  top-down.  Also  note  that  if  lab(N)  £ 
EXT (£)  U  SD(C),  then  CYC (IV)  =  {lab(N)}.  Let  RNCe  be  the  child  node 
of  N.  If  lab{N)  £  SD(C),  then  conseq(Cl9)  =  {lab(N)}  and  C  £  SD{T).  If 
lab{N )  €  EXT(£),  then  0  =  0,  lab(N)  £  conseq(C)  and  C  £  EXT(T). 

The  following  theorem  details  the  basic  properties  of  UQC  trees. 

Theorem  5.2  [6,7,10].  Let  NCR. 

(a)  If  T  is  a  UQC  tree,  then  all  labels  in  T  are  ground. 

(b)  Suppose  that  M  is  a  minimal  model  of  gr(T)\gN  with  P( a)  £  M.  Then  we 
may  find  a  UQC  tree  T  for  P( a)  in  T  such  that  Pred(T)  C  M  C  R  —  0{T) 
and  M(T)  n  IV  =  0. 

(c)  Suppose  that  T  is  a  UQC  tree  in  T,  N{T)  fl  IV  =  0,  and  M  f=  gr(T)\gN 
with  M  n  0{T)  =  0.  Then  Pred(T)  CM. 

In  order  to  perform  top-down  testing  of  membership  in  we  will  need 

the  following  characterisation. 
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Theorem  5.3  [10].  ->Q  G  Da+1  iff  for  each  UQC  tree  T  for  Q  in  T,  either 
\JM{T)  G  D++1  or  gr{T)\g{U-N{T)UK)  ^  V  0(T). 

Note  that  the  two  clauses  on  the  right  hand  side  of  Theorem  5.3  are  actually 
quite  similar,  since  \J  M{T)  G  D^+1  iff  gr(T) \g(H  -  Da)  1=  \J  N{T). 

Note  also  that  if  71,72  are  UQC  trees  for  Q  with  0(71)  C  0{ T2)  &ndJ\f(Ti)  C 
Af(T2),  then  72  is  redundant  as  far  as  Theorem  5.3  is  concerned,  since  \J  Af(Ti)  G 
D„+1  implies  \J M{Ti)  G  D++1,  and gr{T)\g(R-Af(Ti)uDa)  |=  V 0(71)  implies 
gr(T)\g(n  -  M{%)  uk)  N  VTO). 

6  Top-down  query  processing 

We  can  now  combine  the  results  of  previous  sections  to  develop  a  top-down 
method  of  query  processing  under  DWFS.  We  illustrate  the  method  by  means 
of  an  example. 

Example  6.1.  Let  T  consist  of  the  following  rules. 

1.  -.P(y)  A  S{y,  z)  A  Q(z,  y)  -4  Q(a,  y)  V  Q(y,  b) 

2.  -'F(w,w)  A  F(w,x)  AF(x,y)  -4  S(x,y)  3.  E(a,b)  4.  E(a,a) 

5.  ->P{x)  A  Q(x,  x)  -4  R(x)  6.  F(a,  b)  7.  F(b,  a) 

8.  D(a,  b)  9.  D{b,  a)  10.  ~<E(y,  x)  A  D(y,  x)  -4  R(x) 

11.  ->E(x,  y)  A  D(y,  x)  -4  P(y)  V  R(y)  12.  Q(a,  b)  V  Q{b,  a) 

where  x,  y,  z,  w  are  variables  and  a ,  b  are  constants. 

Suppose  that  we  wish  to  test  whether  Q(a ,  a)  V  Q(a,  b)  is  in  DWFS.  In  order 
to  show  that  Q(a,  a)  V  Q(a,  b )  G  D++1,  we  need  to  develop  a  deduction  tree  for 
{Q(a,a),Q{a,b)}  in  gr(T) \g('H  -  Da).  The  only  rule  whose  head  unifies  with  a 
subset  of  {Q(a,  a),  Q(a,  5)}  is  rule  1,  with  the  unifier  {y  -4  0},  thus  resulting  in 
the  partial  deduction  tree  71  in  Figure  6.1(i). 


{Q(a,  a),  Q{a,  6)} 

' 

P(o) 

RN! 

RNn 

S(a,z)  Q{z,a) 

D(a,  b) 

| 

rn8 

71 

t2 

Figure  6.1(i) 

Thus  if  ~<P(a)  G  D~,  then  Q(a,a)  V  Q(a,b)  £  D++1  iff  S(a,z)  V  Q(a,a)  V 
Q(a,  b)  €  -Dj+1  and  Q(z,  a)vQ(a,  a)  V0(a,  b)  G  fi„+1  (for  some  z).  Note  however 
that  if  ->P(a)  ^  D~,  then  the  fact  that  S(a,z)  V  Q(a,a )  V  Q{a,b)  G  D++1  and 
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Q(z,a )  V  Q(a,a )  V  Q(a,b )  £  D++1  (for  some  z)  is  of  no  help  whatsoever.  We 
thus  attack  the  negative  subgoal  ->P(a)  first. 

The  only  UQC  tree  for  P{a)  is  depicted  as  T2  in  Figure  6.1  (i) ,  with  0(72)  = 
{i?(a)}  and  N{T2)  =  {E(b,a)}.  By  Theorem  5.3  we  are  thus  required  to  show 
that  either  E(b,a)  £  D+  or  that  gr{T)\g{H  —  {E(b,a)}  U  D ~_x)  |=  R(a).  In 
general  we  therefore  need  to  pursue  one  of  these  subgoals,  returning  to  the  other 
if  the  first  fails.  In  this  particular  case  it  is  evident  that  E(b,a)  0  D+,  since  no 
rule  has  a  consequent  that  unifies  with  E(b,a). 

We  thus  set  about  trying  to  show  that  gr{T)\g{T-i-{E(b,  a^U-D^Lj)  \=  R(a). 
This  resembles  our  original  problem,  and  is  depicted  in  Figure  6.1(h)  via  the  node 
({_R(a)}:{F(&,  a)}).  The  “rule  node”  RNt2  simply  indicates  the  computation  of 
the  UQC  tree. 


Q(z,a) 


F(a,z) 


We  thus  look  for  a  rule  whose  head  unifies  with  R(a).  Notice  that  rule  5 
should  not  be  applied  since  it  would  introduce  a  duplicate  subgoal  (->P(a))  which 
leads  to  the  circular  argument:  -> P(a )  £  D~  if  ->P( a)  £  D~_1 .  Hence  the  only 
applicable  rule  is  rule  10  yielding  the  two  subgoals  ~'E(y,  a)  and  D(y.  a).  For  the 
former,  we  need  to  look  for  an  instance  of  E(y,  a)  which  is  in  {E(b,  a)}  U  D~_1. 
In  this  case  we  set  y  equal  to  b  (Figure  6.1(h)),  whence  D(b,a )  is  solved  by  the 
application  of  rule  9.  Thus  -*P(a )  £  D~[ . 

Returning  to  the  subgoal  S(a,  z),  we  apply  rule  2  to  yield  three  child  nodes, 
-<F(w,w),  F(w,a)  and  F(a,z).  Again  for  the  first  of  these  we  are  required  to 
generate  UQC  trees  for  instances  of  F(w,w)  in  order  to  find  an  instance  of 
-<F(w,  w )  that  is  contained  in  D~.  Since  F(a,  0)  and  F(b,  b)  have  no  UQC  trees, 
it  is  trivially  the  case  that  both  ->F(a,o)  and  -> F(b,b)  belong  to  D j".  Setting  w 
equal  to  a  would  not  allow  the  next  subgoal  to  be  solved,  thus  we  set  w  equal 
to  b.  F(a,z)  is  then  solved  via  {z  -*  6}  and  rule  6.  Thus  S(a,b)  £  D 

Finally  rule  12  shows  that  Q(a ,  a)  V  Q{a,  b )  V  Q(b,  a)  £  ,  whence  Q(a,  a)  V 

Q(a,b)eD+. 
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Notes.  Our  method  thus  employs  a  combination  of  deduction  tree  and  UQC 
tree  constructions,  and  can  be  viewed  as  lifting  the  methods  of  [10]  to  the  first 
order  level.  At  each  stage,  if  the  current  leaf  node  is  a  positive  atom  (or  a  set 
there-of),  then  we  extend  the  current  branch  by  applying  some  database  rule 
(unifying  the  consequent  with  some  of  the  atoms  on  the  branch) .  If  the  current 
leaf  node  is  a  negative  atom  ->Q(x),  then  we  may  compute  UQC  trees  in  order 
to  (try  to)  find  an  instance  of  -iQ(x)  in  the  relevant  Dp .  As  with  E(y ,  a),  if  the 
most  recent  UQC  tree  gave  rise  to  a  goal  of  the  form  ( V:Q ),  then  we  have  a 
second  option,  which  is  to  try  to  find  an  instance  of  Q(x)  in  Q. 

Each  UQC  tree  yields  a  choice  of  two  subgoals,  the  processing  of  which  is 
independent  of  the  other  parts  of  the  tree,  with  the  exception  that  we  may  need 
to  examine  earlier  nodes  on  the  current  branch  to  prevent  duplicates. 

The  correctness  and  completeness  of  our  method  follows  from  Theorems  3.3, 
4.2,  5.3  and  the  correctness  and  completeness  of  our  constructions  (below)  for 
deduction  and  UQC  trees.  In  addition,  if  these  tree  constructions  terminate,  then 
the  method  as  a  whole  terminates,  since  the  tree  developed  passes  down  the  Da 
hierarchy  (in  particular  disallowing  duplicate  negative  subgoals). 

7  Constructing  deduction  trees 

Constructing  deduction  trees  at  the  ground  level  is  trivial:  at  each  stage  we  ex¬ 
tend  the  current  predicate  leaf  node  N  by  a  rule  C  6  T  such  that  conseq(C)  C 
ACT(iV)  and  antec(C)  n  ACT(iV)  =  0.  This  process  must  terminate  since  du¬ 
plicate  atoms  do  not  appear  along  any  branch.  Testing  derivability  in  positive 
databases  can  be  achieved  “branch  at  a  time”  [7,8],  and  therefore  operates  in 
space  which  is  linear  in  \H\. 

The  first  order  method  of  constructing  deduction  trees  employed  informally 
in  Example  6.1  is  taken  from  [7,8].  The  method  is  top-down,  left-to-right  and 
depth-first,  and  yields  a  correct  and  complete  method  of  testing  derivability  in 
first  order  positive  deductive  databases  [7,8].  The  obvious  difference  at  the  first 
order  level  is  the  use  of  unifying  substitutions,  which  have  the  effect  of  enlarging 
the  search  space.  In  addition,  termination  is  far  more  difficult  to  guarantee,  and 
negative  subgoals  may  need  special  attention,  again  in  order  to  limit  the  size  of 
the  search  space.  These  issues  are  discussed  in  Sections  7. 1-7.3  below. 

7.1  Termination 

During  the  construction  of  deduction  trees  we  can  (by  Theorem  4.2)  employ  an¬ 
cestor  pruning  at  will.  Consequently,  it  is  the  introduction  of  new  variables  into 
the  construction  which  threatens  termination.  In  [7,8]  we  showed  that  termina¬ 
tion  can  be  guaranteed  if  we  apply  ancestor  pruning,  adopt  a  linear  expansion 
of  semi-definite  atoms  (as  per  Theorem  4.2),  and  assume  the  existence  of  a  level 
function  l :  {Pi,  P2, . . . ,  P„}  — \  {0, 1, 2, . . .  ,n  +  1}  such  that 
(a)  EXT(£)  =  {P  €  £  :  £(P)  =  0}  and  SD{£)  =  {P  e  £  :  1  <  £(P)  <  n}.  (We 

may  therefore  define  £(C)  =  1{P)  for  any  P  occurring  in  conseq(C).) 
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(b)  If  C  £  T  and  x  £  UAR(antec(C))  -  F^4i?(conseq(C')),  then  we  may  find  an 
R( t)  £  antec (C)  such  that  £(R)  <  £(C)  and  x  appears  in  t. 

Note  that  the  use  of  ancestor  pruning  within  our  termination  argument  is 
dependent  upon  the  use  of  a  function  free  language. 

7.2  Groundedness  and  yes/no  answers 

We  saw  in  Example  6.1  that  the  subgoal  S(a,z )  becomes  grounded  as  a  result 
of  the  expansion  of  the  tree  below  this  subgoal.  In  [7,8]  it  is  shown  that  in  effect 
this  always  occurs,  thus  allowing  us  to  perform  answer  extraction  in  the  case 
when  the  top-level  goal  is  not  grounded. 

In  addition,  if  we  assume  the  conditions  of  Section  7.1,  and  attack  semi- 
definite  subgoals  first,  then  we  can  ensure  that  if  P(t)  £  INT(£)  —  SD(C) 
occurs  on  the  current  branch,  then  each  variable  in  t  occurs  in  the  root  node 
(whence  again,  ancestor  pruning  limits  the  number  of  such  atoms).  This  has  the 
important  consequence  of  limiting  the  search  space  when  applying  the  disjunctive 
rules  in  INT(T)  —  SD(T).  (As  an  aside,  note  that  the  search  space  is  already 
limited  when  applying  rules  in  SD(T)  (by  linearity  and  definiteness)  and  when 
applying  rules  in  EXT(T)  (by  groundedness).) 

Note  in  particular  that  if  the  root  node  is  ground  (as  it  is  in  a  yes/no  query 
and  in  the  subtree  below  a  UQC  tree),  then  all  positive  atoms  in  INT(£)  ~SD(C) 
lying  on  the  current  branch  are  ground.  In  this  case,  the  unifying  substitution 
when  applying  a  rule  C  £  INT(T)  -  SD(T)  is  simply  a  ground  instantiation  of 
conseq(C'). 


7.3  Negative  subgoals  and  subgoal  re-ordering 

Generating  UQC  trees  for  instances  of  Q(x)  will  (by  Theorem  5.3)  tell  us  which 
instances  of  ~iQ(x)  are  contained  in  the  relevant  Dp  .  It  will  not  however  tell  us 
which  of  these  instances  to  pick,  hence  the  choice  is  somewhat  arbitrary.  This  is 
not  desirable,  since  in  resolution  based  reasoning  the  unifications  are  expected 
to  perform  the  necessary  instantiations.  Two  solutions  which  guarantee  that  x 
becomes  grounded  before  we  attack  ~>Q(x)  are  as  follows. 

Since  VAR(Af(C))  C  UAi?(antec(C))  we  could  re-order  the  subgoal  -iQ(x) 
so  that  it  appears  to  the  right  of  (ie.,  after)  other  positive  siblings  containing  the 
variables  in  x.  Since  such  siblings  become  grounded  as  a  result  of  their  expansion, 
this  would  have  the  desired  effect.  On  the  other  hand  the  re-ordering  itself  is 
somewhat  undesirable  since  (as  explained  in  Example  6.1)  we  would  prefer  to 
attack  negative  subgoals  first. 

An  alternative  solution  is  to  make  a  further  assumption: 

(*)  If  C  £  INT(T)  -  SD(T)  and  Q(x)  e  M{C)  with  Q  £  INT(£)  -  SD(C),  then 
V AR(x)  C  UAfZ(conseq(C)). 

As  in  Section  7.2,  in  the  case  when  the  root  is  grounded,  condition  (*)  has  the 
effect  of  ensuring  that  a  negative  subgoal  ->Q(x)  (with  Q  £  INT(£)  —  SD(C)) 
will  be  grounded  as  soon  as  it  enters  the  tree.  This  still  leaves  the  problem  of 
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ungrounded  negative  subgoals  ->Q(x)  when  Q  E  EXT(£)  U  SD(C),  but  this  is 
no  loss  since  condition  (b)  of  Section  7.1  still  requires  us  to  handle  such  subgoals 
below  ungrounded  positive  semi-definite  subgoals  (as  with  ->F(w,w)  in  Example 
6.1).  Moreover  the  relevant  UQC  trees  are  far  easier  to  compute  (ie. ,  in  a  purely 
linear  fashion).  In  addition,  they  also  allow  a  simpler  testing  of  -iQ(x)  E  DL+1, 
since  if  Q( a)  E  EXT(£)  U  SD(C)  and  T  is  a  UQC  tree  for  Q( a),  then  Jf{T)  C 
EXT(£)  U  SD(C),  and  0(7)  C  EXT(£)  (whence  gr(T)\g(H  -  M(T)  U  Df)  (= 
\J  0(T)  iff  0(T)  contains  some  disjunct  from  EXT(T)). 

8  Computing  UQC  trees 

Let  P(t)  G  EXT(£)U5'-D(£).  In  [6]  we  showed  that  cyclic  (and  hence  UQC)  trees 
for  instance  of  P(t)  can  be  computed  via  a  top-down,  left-to-right,  depth-first 
and  linear  construction.  In  order  to  guarantee  termination,  ancestor  pruning  is 
employed  (as  dictated  by  condition  (d)  of  Definition  5.1)  and  we  assumed  the 
existence  of  a  level  function  l  as  in  Section  7.1  such  that 
(f)  If  C  G  SD(T)  and  x  G  UAP(antec(C))  —  UAP(conseq(C')),  then  we  may 
find  an  P(t)  G  antec (C)  such  that  l(R)  <  1(C)  and  iGt. 

Note  that  (f)  is  subsumed  by  condition  (b)  of  Section  7.1,  hence  it  is  no 
further  imposition.  In  [6,7]  we  presented  a  construction  of  UQC  trees  for  atoms 
P(t)  G  INT(£)  -  SD(C)  under  conditions  (f)  and  (jj): 

(ft)  If  C  G  INT(T)  -  SD(T)  and  *  G  VAR(C),  then  we  may  find  an  P(t)  G 
antec(C)  such  that  i(R)  <  1(C)  and  x  G  t. 

However  (#)  is  a  far  more  stringent  condition  than  (|).  We  can  eliminate  the 
need  for  (ft)  by  partially  compiling  the  construction  of  UQC  trees  for  atoms  in 
INT(£)  -  SD(C)  using  semi-factored  quasi  cyclic  trees  (see  below).  Such  com¬ 
pilation  is  motivated  and  justified  by  the  fact  that  we  generally  expect  INT(T) 
to  be  relatively  static  (in  comparison  to  EXT(T)).  Pre-processing  of  cyclic  trees 
has  also  been  employed  in  [9]  to  facilitate  query  compilation  in  propositional 
stratified  databases. 

8.1  Semi- factored  quasi  cyclic  trees 


Definition  8.1.1.  Let  P(t)  G  INT(£)  —  SD(C).  A  semi-factored,  quasi  cyclic 
(SQC)  tree  for  P(t)  is  a  quasi  cyclic  tree  T  in  which  for  each 

predicate  node  N ,  N  is  a  leaf  node  iff  lab(N)  G  EXT(£)  U  SD(C). 

Let  leaf(T)  =  {N  :  N  is  a  predicate  leaf  node  in  T). 

UQC  trees  are  constructed  by  extending  SQC  trees  in  the  obvious  way. 

Theorem  8.1.2.  Let  P(  a)  GHfl  (INT(£)  -  SD(C)).  Then  T  is  a  UQC  for 
P( a)  in  T  iff  we  may  find  an  SQC  tree  T'  for  P( a)  in  gr(T),  and  for  each 
N  G  leaf(T')  we  can  find  a  UQC  tree  Sn  for  lab(N)  in  T  such  that 
(a)  T'  is  an  initial  segment  ofT, 


386  C.A. Johnson 


(b)  for  each  N  G  leaf(T'),  Sx  equals  the  subtree  of  T  below  N,  and 

(c)  U^redGSjv)  :  N  G  leaf(T')}  n  \J{0(SN)  :  N  G  leaf(V)}  =  0. 

Moreover  if  conditions  (a),  (b)  and  (c)  are  satisfied,  then  Pred(T),  0{T)  and 

Miff)  are  the  union  of  the  corresponding  sets  from  T'  and  {Sn  :  N  G  leaf(T')}. 


8.2  Partial  compilation  of  UQC  trees 

We  will  partially  compile  the  computation  of  UQC  trees  for  atoms  in  INT(£)  - 
SD(fi).  The  compilation  phase  will  consist  of  computing  a  set  of  SQC  trees,  and 
the  run-time  computation  of  UQC  trees  will  involve  extending  these  SQC  trees 
to  UQC  trees. 


8.2.1  Computing  SQC  trees.  As  the  construction  of  a  first  order  quasi  cyclic  tree 
proceeds,  existing  branches  are  extended  by  the  application  of  rules  and  unifying 
substitutions.  The  main  problem  is  to  ensure  that  these  unifying  substitutions 
do  not  violate  the  conditions  of  Definition  5.1.  This  is  complicated  still  further 
by  the  fact  that  the  unifying  substitutions  can  cause  the  CYC  sets  to  be  altered. 
In  [6,7]  we  showed  that  when  computing  UQC  trees,  these  problems  can  be 
overcome  by  the  use  of  conditions  (f)  and  (|J),  since  these  cause  sufficiently  large 
parts  of  the  tree  to  become  grounded  as  the  construction  proceeds. 

In  this  section  we  wish  to  compute  SQC  trees  without  resorting  to  ((]).  ((f) 
is  of  no  help,  since  SQC  trees  employ  only  rules  in  INT(T)  -  SD(T).)  Thus  in 
order  to  overcome  the  above-mentioned  problems,  we  will,  as  the  construction 
proceeds,  develop  a  set  of  constraints  I  which  will  guarantee  the  conditions  of 
Definition  5.1,  and  also  ensure  that  the  CYC  sets  do  not  change.  Termination 
of  our  construction  is  guaranteed  by  the  fact  that  the  length  of  any  branch 
through  an  SQC  tree  in  gr(T )  is  bounded  by  1  +  \H'\  *  (|7^'|  +  l)/2,  where 
%'  =  Ti  fl  (INT(£)  —  SD(jC) )  [6].  Of  course  %'  is  likely  to  be  large,  but  this 
compilation  only  needs  to  be  performed  once  (unless  INT(T)  —  SD{T)  changes). 

Suppose  that  we  have  constructed  a  partial  SQC  tree  T  with  existing  con¬ 
straints  I.  A  valid  extension  of  (T,  X)  is  constructed  as  follows.  Pick  a  branch 
through  T,  ( root(T )  =  Pl(xi),P2(x2),  -  • . , N  —  Pr(xr)),  where  Pr  G  INT(£)  - 
SD(C)  and  r  <  \H'\  *  (|P'|  +  l)/2.  1  will  contain  {x*  ^  xj  :  i  <  j  <  r^Pfixi) 
and  Pj(xj)  are  unifiable,  but  x,  x^}.  First  we  need  to  fix  CYC (IV). 

If  Pr(xr)  G  (P(x,)  :  i  <  r},  then  T  already  ensures  that  CYC(IV)  is  fixed, 
thus  suppose  that  Pr(xr)  $  (Pj(xj)  :  i  <  r}.  If  Pr(xr)  is  not  unifiable  with  any 
Pj(xj)  (i  <  r),  then  CYC(IV)  =  {lab(N)}  is  fixed.  Finally  if  Pr(xr)  is  unifiable 
with  some  Pj(xj)  ( i  <  r )  then  we  either  (i)  pick  some  such  i  such  that  the  most 
general  unifier  p  =  mgu{xr,  x,}  does  not  violate  X,  and  apply  //,  to  T  and  X,  or 
(ii)  add  {xr  ^  x;  :  i  <  r,Pr(xr )  and  P;(xj)  are  unifiable}  to  X. 

We  can  then  extend  our  branch  with  some  rule  C  G  T.  We  need  to  pick  a 
most  general  unifier  77  for  a  subset  of  conseq((7)  and  a  subset  of  CYC  (N)  such 
that  7]  does  not  violate  X,  a,ntec(CT])nCYC(N)rj  =  0  and  CYC(N)r]  —  conseq(Cr)) 
is  disjoint  from  Pred(Tr})  U  antec((7?7)  (cf.,  condition  (c)  of  Definition  5.1).  r)  is 
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applied  to  T  and  I,  and  Trj  is  then  extended  with  RNcv  (and  the  corresponding 
child  nodes  of  RNcn )•  Let  T'  be  the  extended  tree. 

For  each  Q(x)  £  0(RNcn ),  if  Q(x)  is  unifiable  with  some  Q(x')  £  Pred(T'), 
then  we  add  x  ^  x'  to  1.  Similarly  if  Q(x)  £  antec(C'r?)  is  unifiable  with  some 
Q(x')  £  CYC(iV),  then  we  add  x  yf  x'  to  1. 

A  pair  (Tm,lm)  is  complete  iff  Tm  is  an  SQC  tree  and  there  is  a  se¬ 
quence  To),  (71, Pi), . .  • ,  (Tm,lm)  such  that  7o  consists  of  the  single  root 
P(x i,X2,  ■  ■  ■  ,Xk),  2q  =  0,  and  each  (7I+i,lj+i)  is  a  valid  extension  of  (7i,Zj). 


Theorem  8.2.2  Let  P( a)  £  R  fl  (INT(£)  —  SD(C)).  Then  T  is  an  SQC  tree 
for  P( a)  in  gr(T)  iff  there  is  complete  pair  (Tm,Tm)  a'rid  a  substitution  9  such 
that  6  that  does  not  violate  Im,  T  =  Tm6  and  lab(root(T ))  =  P(a). 

After  constructing  a  complete  pair  (T,X),  we  can  then  discard  the  tree  T 
itself:  we  simply  need  to  keep  ( root(T),leaf(T),Af(T),0(T),l )■  As  indicated 
at  the  end  of  Section  5,  we  can  also  eliminate  redundant  SQC  trees.  Specifically 
if  we  can  find  a  substitution  8  such  that  root(T)8  —  root(T'),  leaf(T)9  C 
leaf(T'),N(T)8  C  A f(T'),0(T)8  C  0(T')  and  18  C  I',  then  we  can  discard 
(root(T'),leaf(T'),Af(T'),0(T'),l'). 

For  example  suppose  that  INT(T)-SD(T)  =  { S(x )  ->  Q(x)\/R(x),S'(x,y) A 
Q(y)  -)■  Q(x)  V  P(x)},  with  INT(£)  -  SD(jC)  =  {P,  Q,  R}.  SQC  trees  for  Q(x) 
can  be  developed  by  m  applications  of  the  second  rule  followed  by  a  single 
application  of  the  first.  For  m  >  0  the  tree  developed  is  subsumed,  in  the  above 
sense,  by  the  tree  developed  for  m  =  0. 


8.2.3  Computing  UQC  trees  from  SQC  trees.  By  Theorems  8.1.2  and  8.2.2,  if 
P( a)  £  INT (C)-SD(C)  then  Tis  a  UQC  trees  for  P(a)  iff  we  can  find  a  complete 
pair  (Tm,lm)  and  for  each  N  £  lea f(Tm)  a  UQC  tree  Sjv  for  some  instance 
lab(N)8  of  lab(N)  such  that  8  does  not  violate  Im,  lab(root(Tm))8  =  -P(a)  and 
Tm9  and  {5^  :  N  €  leaf(Tm)}  satisfy  the  conditions  of  Theorem  8.1.2. 


9  Conclusions  and  further  research 


We  have  presented  a  top-down  correct  and  complete  query  processing  method 
for  first  order  deductive  databases  under  the  DWFS.  We  have  also  investigated 
termination  and  efficiency  aspects  of  our  method  by  examining  partial  compila¬ 
tion,  subgoal  re-ordering  and  restrictions  on  database  rule  format.  Our  method 
is  based  upon  a  branch  by  branch  tree  traversal,  and  therefore  (in  common  with 
the  methods  of  [1])  for  propositional  databases  operates  in  space  that  is  poly¬ 
nomial  in  the  size  of  the  underlying  language.  The  techniques  presented  in  this 
paper  have  also  been  applied,  suitably  modified  to  the  perfect  and  the  disjunctive 
stable  model  semantics  [6-9,11]. 

The  following  open  questions  are  worthy  of  further  investigation. 
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(i)  Can  our  techniques  handle  query  compilation  in  DWFS,  in  which  we  pre- 
process  a  query  using  INT(T)  so  that  its  run-time  processing  then  involves 
EXT(T)  only? 

(ii)  To  what  extent  can  we  employ  search  space  pruning  techniques  in  order  to 
make  the  tree  constructions  more  efficient.  In  particular,  can  we  use  infor¬ 
mation  from  a  UQC  tree  construction  (say)  to  prune  the  resulting  deduction 
tree  constructions? 

(iii)  Can  our  methods  be  extended  to  languages  containing  function  symbols? 

(iv)  Could  we  guarantee  termination  with  weaker  constraints  and/or  promote 
efficiency  either  by  combining  our  methods  with  the  bottom-up  methods  of 
[1],  or  by  making  further  “natural”  assumptions  about  deductive  databases 
(eg.,  with  respect  to  the  amount  of  disjunctive  or  recursive  information)? 
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Abstract.  We  propose  a  system  whereby  subtle  semantic  ambiguity 
found  in  queries  of  distributed  heterogeneous  database  systems  can  be 
resolved  by  considering  the  user’s  intentions.  Through  the  use  of  domain- 
specific  knowledge  embedded  within  a  mediator-based  architecture,  sub¬ 
tleties  in  meaning  can  be  explicitly  modeled.  Through  the  use  of  dynamic 
profiles  and  active  dialogue,  the  system  can  discover  user  intent,  provi¬ 
ding  more  satisfying  query  answers. 


1  Introduction  and  Problem  Statement 

Modern  heterogeneous  database  systems  generally  require  users  to  issue  queries 
via  a  global  query  language.  This  is  because,  typically,  no  one  member  database 
of  the  distributed  system  has  all  the  concepts  of  the  whole  system.  Moreover, 
two  identical  terms  in  two  different  local  schemas  may  have  slight  semantic 
differences  in  the  context  of  the  entire  network.  To  avoid  the  ambiguity  caused  by 
this  heterogeneity,  many  systems  have  a  global  language  containing  a  vocabulary 
of  terms  with  exact  pre-specified  meanings.  Wrappers  are  then  used  to  translate 
global  terms  into  their  meanings  within  a  local  schema’s  context. 

This  global:  approach  places  excess  burden  on  the  users  to  first  understand 
the  semantic  differences  that  terms  may  have  in  different  databases  and  then  to 
very  carefully  express  queries  to  precisely  reflect  their  desired  semantics.  First, 
note  that  global  concepts  are  often  expressed  as  generalizations  that  are  broad 
enough  to  cover  the  varying  meanings  of  that  concept  at  the  local  levels.  The  user 
has  to  be  aware  of  the  different  nuances  that  terms  might  have  in  the  other  local 
repositories  and  add  appropriate  constraints  to  be  sure  his/her  intent  is  specified. 
Second,  as  new  databases  join  the  global  network,  existing  terms  may  take  on 
even  more  nuances,  new  terms  and  concepts  may  enter  the  global  language, 
and  perhaps  even  the  global  schema  itself  may  change.  Users  will  then  need  to 
become  familiar  with  the  changes,  and  so  user  training  in  a  dynamic  network  is 
not  a  one-time  affair  like  many  researchers  claim.  Similarly,  applications  that  had 
been  written  for  the  local  database  and  been  modified  to  adapt  to  the  network 
at  a  particular  point  in  time  may  also  need  additional  updating  as  the  network 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  389-399,  2000. 
©  Springer-Verlag  Berlin  Heidelberg  2000 


390  C.  Fernandes  and  L.  Henschen 


changes.  For  the  same  reason,  commonly  run  queries  that  were  written  and  saved 
under  an  older  incarnation  of  the  global  schema  may  also  need  to  be  updated. 

The  opposite  approach  allows  users  to  specify  queries  in  their  own  local 
database  language  and  provides  translators  to  map  those  queries  to  a  global 
form  that  reflects  the  semantics  of  that  local  database.  This  allows  users  to  issue 
distributed  queries  and  to  correctly  interpret  answers  without  having  to  learn  a 
new  language  and  be  continually  retrained.  This  would  be  a  distinct  advantage 
for  non-sophisticated  users,  and  it  is  reasonable  to  expect  a  large  number  of 
database  users  to  fit  into  this  category.  For  example,  a  distributed  information 
system  might  include  car  rental  companies  and  meteorological  databases,  both  of 
which  use  the  term  “map”  with  related  but  distinct  meanings.  A  counter  agent 
at  a  car  rental  location  may  never  care  about  seeing  meteorological  maps  of  the 
region,  only  street  maps  to  give  to  customers  renting  cars.  Such  counter  agents 
are  probably  not  database  experts  and  should  not  be  required  to  know  about 
all  the  other  kinds  of  maps  that  may  be  available  and  how  to  specify  just  street 
maps  when  issuing  a  query.  On  the  other  hand,  the  local-query  approach  forces 
any  user  who  wants  to  take  advantage  of  the  variety  of  related  information  to 
again  learn  the  global  language  and  continuously  keep  up  to  date  as  the  network 
evolves.  A  travel  planner  in  the  “maps”  example  may  very  well  want  one  or 
both  kinds  of  maps  when  planning  a  group  tour  to  account  for  both  the  actual 
transportation  as  well  as  possible  weather-related  contingencies.  Even  more,  the 
planner  may  issue  the  same  query,  for  example  “get  a  map  of  the  Boston  area” , 
at  different  times  and  want  different  kinds  of  maps  based  on  what  aspect  of  the 
tour  is  being  planned  at  that  moment.  But,  again,  travel  planners  are  not  likely 
to  be  database  experts,  and  it  would  be  advantageous  to  develop  some  kind  of 
system  that  would  help  such  a  user  cope  with  a  semantically  diverse  and  evolving 
distributed  knowledge  base. 

We  will  describe  in  this  paper  a  system  that  attempts  to  merge  the  best 
features  of  each  approach  and  to  help  users  of  both  kinds.  Queries  are  expressed 
in  local  database  languages,  and  the  query-processing  algorithm  uses  the  query 
and  a  global  knowledge  base  in  combination  with  information  from  an  individual 
user  profile  and  even  user  dialogue  to  translate  the  local  query  into  a  global  one. 
Through  the  use  of  the  profile  and  dialogue,  our  system  tries  to  discover  the  real 
intent  of  the  user  query.  In  addition,  our  system  attempts  to  help  the  user  specify 
the  intent  when  it  can’t  be  guessed  or  when  the  user  has  requested  help.  The 
development  and  use  of  individual  user  profiles  and  the  nature  of  the  dialogue 
are  the  primary  foci  of  this  paper.  They  are  described  in  Section  3.  In  Section  2, 
we  differentiate  our  work  from  other  research  in  this  area,  but  space  limitations 
preclude  any  extensive  discussion  of  prior  and  related  work.  Section  5  presents 
concluding  remarks  and  the  direction  we  would  like  to  take  in  the  future. 


2  Related  Work 

We  have  chosen  a  mediator-based  system  for  our  implementation  since  it  allows 
for  the  two  most  important  criteria  we  desire:  the  ability  to  ask  queries  through 
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local  member  databases  and  the  presence  of  a  global  knowledge  base  to  perform 
translation  on  a  query-by-query  basis.  Since  intent  can  change  over  time,  having 
this  latter  property  is  critical.  Many  other  prototypes  are  also  mediated  systems, 
but  most  do  not  consider  the  idea  of  intent  as  we  have  stated  it. 

Most  mainstream  prototypes  use  a  single  global  language  to  represent  que¬ 
ries.  These  include  rule-based  languages  such  as  the  one  used  by  the  HERMES 
system  [7],  description  logic-based  languages  such  as  Loom  used  by  SIMS  [1], 
clause-based  languages  such  as  that  used  by  Information  Manifold  [4],  and  others 
such  as  OQL  used  by  DISCO  [8].  Using  these  global  languages  provides  certain 
advantages.  HERMES,  for  example,  is  able  to  incorporate  a  degree  of  probabi¬ 
lity  into  its  rules,  while  Information  Manifold’s  use  of  logical  clauses  provides  a 
robust  translation  from  its  global  language  into  local  subqueries.  However,  they 
all  prevent  the  user  from  using  a  local  language  with  which  s/he  may  already  be 
familiar,  and  they  all  prevent  the  use  of  applications  which  already  utilize  that 
local  language. 

A  means  of  adjusting  to  different  user  intents  is  another  aspect  found  in 
only  a  few  prototype  systems.  One  such  system,  TSIMMIS  [2],  is  web-based 
and  allows  for  a  limited  type  of  intent  clarification  by  including  hypertext  in  a 
query  result.  By  this  hypertext  the  user  can  see  more  general  or  more  specific 
information  concerning  the  objects  involved  in  the  query  answer.  This  could  be 
extended  so  that  the  user  could  reissue  a  query  after  specifying  a  more  appro¬ 
priate  level  of  object  detail  or  after  delineating  ambiguous  concepts.  We  propose 
a  dialogue  system  which  would  clearly  define  user  intent  on  a  level  not  seen  in 
these  prototypes. 

Another  important  aspect  which  we  include  in  our  system,  but  which  others 
do  not,  is  conditional  attribute  equivalence.  Essentially,  this  term  means  that  any 
two  attributes  which  are  considered  to  be  equivalent  or  at  least  synonymous  for 
one  query  may  not  be  considered  equivalent  or  synonymous  for  another.  Consider 
a  distributed  system  dealing  with  universities,  where  one  local  database  keeps 
track  of  an  undergraduate-only  institution  while  the  other  holds  data  of  a  uni¬ 
versity  with  graduates  and  undergraduates.  Two  “student”  attributes  in  these 
two  local  databases  may  be  considered  equivalent  for  a  query  dealing  with  coun¬ 
ting  up  all  students  but  quite  dissimilar  for  a  query  dealing  solely  with  graduate 
students.  Existing  systems  such  as  OBSERVER  [5]  may  use  a  single  ontological 
term  to  relate  a  priori  to  multiple  local  attributes.  As  a  result,  there  is  no  easy 
way  for  these  equivalences  to  change  from  query  to  query.  Our  approach  will  use 
constructs  called  annotations  [3,6]  to  accomplish  conditional  attribute  equiva¬ 
lence.  These  annotations  will  provide  for  query-dependent  processing,  which  we 
believe  is  essential  for  determining  user  intent. 


3  Discovering  a  User’s  Intentions 

Our  system  discovers  different  possible  query  interpretations  at  run-time.  Due 
to  space  limitations  we  will  not  explain  our  query  processing  system  in  detail; 
the  interested  reader  is  referred  to  [3,6].  When  necessary,  we  will  explain  those 
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Person 

Name 

Phone 

State 

Fred 

111-1111 

VA 

Sally 

222-2222 

IL 

Tom 

333-3333 

HI 

Fig.  1.  Sample  query  in 


SELECT  name,  phone 
FROM  Person 
WHERE  state="IL" 

an  Employee  Database 


features  of  the  query  processor  that  intersect  with  our  techniques  for  determining 
the  user’s  true  intentions. 


3.1  A  Motivating  Example 

We  first  illustrate  that  subtle  heterogeneity  can  creep  into  even  the  most  inno¬ 
cuous  of  queries.  Figure  1  shows  a  table  from  an  employee  database  showing  each 
worker’s  name,  phone  number,  and  place  of  residence.  The  accompanying  query 
is  asking  for  the  names  and  phone  numbers  of  those  employees  residing  in  IL. 
While  the  semantics  for  such  a  query  would  be  clear  for  any  one  relational  data¬ 
base,  it  becomes  more  complex  if  this  is  a  distributed  query  over  many  employee 
tables.  Figure  2  shows  two  local  databases  containing  identically  structured  em¬ 
ployee  tables,  but  differing  in  their  interpretation  of  the  phone  attribute.  In 
DB1,  the  phones  are  all  cellular  phones.  They  do  not  have  a  permanent  loca¬ 
tion,  and  they  are  probably  kept  with  the  employee  most  of  the  time.  However, 
in  DB2,  the  phones  are  regular  permanent  phones,  located  most  likely  at  the 
employee’s  place  of  residence.  This  difference  is  not  reflected  in  either  of  the  two 
tables,  and  it  produces  two  different  query  interpretations.  Did  the  user  specify 
“IL”  because  s/he  intended  to  call  there,  perhaps  looking  for  the  employee’s 
family?  Or  did  the  user  intend  to  contact  people  based  in  IL,  regardless  of  where 
their  phones  are?  One’s  initial  solution  in  a  small  example  like  this  might  be 
to  return  both  sets  of  phone  numbers  so  that  the  user  could  decide.  But  this 
solution  is  not  feasible  in  general,  especially  where  scalability  and  time  are  fac¬ 
tors.  In  an  emergency  situation  such  as  a  plane  crash,  passenger  manifests  need 
to  be  used  to  notify  family  members  quickly.  In  this  scenario,  permanent  phone 
numbers  would  be  desired,  not  cell  phone  numbers.  Having  the  system  return 
hundreds  of  extraneous  tuples  that  the  user  must  sift  through  is  not  practical. 

3.2  Finding  Ambiguities  in  Intent 

In  order  to  choose  the  desired  interpretations  of  a  given  query,  a  system  must 
be  able  to  explicitly  model  subtle  semantic  differences  like  the  one  given  here. 
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DB1:  cellular  phones 


Person 

Name 

Phone 

State 

Fred 

111-1111 

VA 

Sally 

222-2222 

IL 

Tom 

333-3333 

HI 

DB2:  regular  phones 


Empl 

Name 

Regphone 

Place 

Sally 

444.4444 

IL 

Carl 

555-5555 

IL 

Beth 

666-6666 

VA 

Fig.  2.  Subtle  Heterogeneity  in  Logically  Identical  Tables 


Person 


{DBl.Person.name, 

DBl.Person.phone} 


{DB2.empl.name, 

DB2.empl.regphone} 


A 


{DBl.Person.name, 

DBl.Person.state} 


{DB2.empl.name, 

DB2.empl.place} 


annotation 


{DB2.empl.regphone, 

DB2.empl.place} 


Fig.  3.  Section  of  Mediator  used  for  Phone  Example 


The  ambiguity  in  the  phone  example  stems  from  the  fact  that  there  are  distinct 
relationships  between  the  phone  and  state/place  attributes  in  DB1  and  DB2. 
We  have  developed  a  graph-based  mediator  [3,6]  that  precisely  and  explicitly 
maps  out  concepts  and  relationships  at  the  global  level  and  relates  these  to  the 
concepts  and  relationships  in  the  individual  local  databases.  Although  the  global 
representation  and  the  related  algorithms  are  not  the  focus  of  this  paper  (see 
[3,6]  for  full  discussions  of  these),  we  illustrate  briefly  how  our  system  models 
semantic  variations. 

Figure  3  shows  a  portion  of  the  mediator  for  our  phone  example.  Vertices 
represent  concepts,  while  edges  represent  relationships.  Knowledge  is  embedded 
in  the  mediator  in  the  form  of  annotations.  An  annotation  is  an  ordered  pair 
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iL  New  User  Profile  Generator 


HUES 


With  respect  to  telephones  the  system  knows  about  2 
different  types:  cellular  (mobile)  phones  and  regular 
(non-mobile)  phones.  Which  type(s)  should  be  considered? 


celphones  on^ 


C  regular  phones  onfci 


Fig.  4.  Creating  a  User  Profile 


{x,y}  where  x  and  y  are  local  database  attributes.  If  an  annotation  {or,  y}  is 
associated  with  a  relationship  R,  then  R  must  hold  between  local  attributes  x 
and  y.  Thus,  the  annotation  {DB2.empl.regphone,  DB2.empl. place}  shown  in 
Figure  3  means  that  the  relationship  rings  must  hold  between  the  Phone  and 
Location  concepts  in  local  database  DB2.  That  is,  for  DB2,  the  phone  must 
ring  in  the  specified  place.  The  lack  of  a  similar  annotation  for  DB1  for  the 
rings  relationship  means  that  rings  does  not  hold  or  is  unknown  in  DB1.  When 
analyzing  this  section  of  the  graph  during  query  processing,  our  algorithm  can 
use  the  presence  or  absence  of  certain  annotations  to  detect  possible  differences 
in  intent. 


3.3  Resolving  Ambiguities  in  Intent 

Once  the  source  of  competing  interpretations  has  been  found,  the  system  can 
then  move  to  the  task  of  resolving  the  ambiguity.  We  have  developed  two  me¬ 
thods  for  determining  what  the  user  truly  intended:  individual  user  profiles  and 
on-the-fly  dialogue. 

A  user  profile  is  a  list  of  preconceived  notions  that  the  user  has  about  what 
local  query  concepts  mean.  A  profile  may  contain  domain-related  information, 
e.g.  the  concept  student  means  just  undergraduates.  It  may  also  contain  general 
query  processing  preferences,  e.g.  to  exclude  a  particular  local  database  from 
contributing  answers  or  to  allow  separate  databases  to  participate  in  providing 
partial  answers  that  can  be  combined  into  a  complete  whole  (i.e.  inter-database 
joins.) 
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A  profile  can  be  created  and  modified  before  query  run-time.  In  the  system 
we  are  developing,  a  profile  is  initially  created  when  a  new  user  logs  into  the 
system  for  the  first  time.  At  that  point,  the  user  is  asked  a  series  of  questions 
about  how  s/he  interprets  certain  local  concepts.  A  screenshot  in  our  system  for 
this  process  is  shown  in  Figure  4.  The  result  is  a  personal  knowledge  base  that 
the  query  processor  can  consult  to  help  determine  a  user’s  intent.  In  our  phone 
example,  the  query  processor  could  examine  this  profile  when  it  considered  the 
rings  relationship  shown  in  Figure  3.  If  the  user  had  indicated  that  only  non- 
mobile  phones  are  of  interest,  then  DBl,  containing  only  cell  phones,  would  be 
eliminated  from  the  search  space. 

Profiles  need  not  exist  only  for  individual  users,  however.  Profiles  for  specific 
classes  of  individuals  could  also  be  created.  For  example,  all  incoming  business 
school  students  at  a  university  or  all  new  secretaries  in  a  corporation  could  have 
a  class  profile  that  would  ensure  a  uniform  query  vocabulary.  The  fact  that  a 
local  query  language  is  being  used  would  provide  a  foundational  interpretation 
of  terms  for  creating  such  a  class  profile.  Similarly,  a  profile  could  be  set  up  for 
an  application,  extending  its  access  from  the  original  database  to  a  distributed 
federation  of  databases.  Moreover,  as  the  global  schema  changes  in  response  to 
the  addition  or  deletion  of  member  databases,  administrators  could  modify  these 
class  profiles  to  keep  end  users  and  applications  up-to-date.  These  changes  would 
be  transparent  to  non-sophisticated  users  while  still  allowing  more  advanced 
users  to  tweak  their  own  profiles  if  they  so  desire. 

In  addition  to  a  profile  generator,  we  have  also  implemented  a  DBA  toolkit 
by  which  profile  generators  can  be  created  quickly.  This  toolkit  is  illustrated  in 
Figure  5.  We  envision  that  both  local  and  global  database  administrators  are 
in  the  best  position  for  understanding  the  subtle  semantic  differences  between 
the  syntactically  identical  concepts  across  individual  member  databases.  The 
administrators  can  then  utilize  this  tool  to  form  the  questions  which  will  make 
up  profile  generators.  These  generators  can  be  tailored  by  each  local  DBA  so  that 
questions  are  best  expressed  to  match  the  level  of  user  expertise  and  “lingo”  in 
use  at  that  local  site. 

A  query-independent  mechanism  like  a  user  profile  is  not  sufficient  by  itself, 
however.  A  user  may  be  aware  that  his/her  interpretation  of  a  term  may  change 
depending  on  the  query  asked.  The  profile  generator  accounts  for  this  by  allowing 
the  user  to  pick  a  “depends”  option  as  shown  in  Figure  4.  This  tells  the  system 
that  when  a  particular  concept  is  used  in  a  query,  its  meaning  will  need  to  be 
determined  at  run-time.  In  this  way,  a  user  would  not  need  to  continually  update 
his/her  profile  just  because  s/he  wishes  to  execute  a  different  query.  Instead  the 
“depends”  option  signals  the  query  processor  to  ask  the  user  (say,  via  a  set  of 
choices  in  a  dialogue  box)  for  the  user’s  interpretation  of  the  concept  for  this 
query.  This  type  of  dialogue  is  called  on-the-fly  dialogue. 

Consider  the  mediator  subset  in  Figure  6  dealing  with  an  ecommerce  net¬ 
work.  Here  a  single  member  database,  DB3,  allows  two  possible  interpretations 
for  queries  dealing  with  buyers.  The  Buyer  concept  can  either  represent  those 
who  have  actually  made  purchases  of  a  given  product  (kept  in  the  Product  ta- 


396 


C.  Fernandes  and  L.  Henschen 


Type  quertion  tort  hete: 

Which  types  of  databases  should  be  used'’ 
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Fig.  5.  DBA  Toolkit  for  Creating  Profile  Generators 


{DB3.Product.Buyer_ID, 

DB3.Product.prod_id} 

purchases 


Buyer 


Product 


in  market  to  buy 


{ I)  B3.  Want. Customer!  I), 
DB3.Want.Product_name} 


Fig.  6.  Subset  of  Mediator  used  in  an  Ecommerce  Network 


ble  of  DB3),  or  it  can  represent  those  who  are  just  in  the  market  to  buy  said 
product  (kept  in  the  Want  table  of  DB3.)  Such  a  distinction  would  be  made, 
for  example,  in  a  name-your-own-price  website  such  as  priceline.com.  During 
processing  of  a  query  dealing  with  buyers,  the  mediator  would  detect  the  two 
different  relationships  possible  within  DB3  and  consult  the  user’s  profile  to  find 
a  preference.  If  one  is  not  found  (if  the  user  chose  “depends”  as  an  answer,  for 
example)  then  the  user  would  need  to  be  asked  to  specify  which  relationship 
s/he  wanted.  Tuples  from  the  correct  table  in  DB3  can  then  be  retrieved. 

The  gravest  potential  pitfall  of  on-the-fly  dialogue  is  its  overuse.  Since  it 
is  an  interruption  during  query  processing,  we  wish  to  eliminate  unnecessary 
interaction.  We  have  several  ways  by  which  our  system  can  deduce  a  correct 
interpretation.  First,  the  profiles  of  other  users  of  the  same  local  database  can 
be  used  as  a  guide.  If  most  users  interpret  a  term  in  the  same  way,  the  system 
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DB1:  Dan’s  Good  &  Services 

Transaction (PURCHASE  #,  date,  buyer 
product_code,  quantity) 

Product (PRODUCT_CODE,  product_name, 

description,  amt_in_stock,  unit_price) 

DB2:  Al’s  Auctions 

Product (PRODUCT_ID,  description 
opening_bid) 

Auction (AUCTI0N_ID,  seller_id,  product_id, 

high_bid,  buyer_id,  closing_time,  numjbids) 

Seller (SELLER_ID,  seller  name ,  seller_email) 

Buyer (BUYER_ID,  buyer_name,  buyer_email, 
current_bid,  product_id) 

DB3:  Sam’s  Shopbot  Site 

Buyer ( SAMJtD# ,  name,  email) 

Want (SAM_ID#,  PRODUCT_NAME ,  PRICE) 

Product (PR0D_ID,  SAM_ID#,  product_name,  seller, 
actual _pr ice ,  qty,  time_to_f ind) 

Fig.  7.  Schemas  Used  in  Experimental  Ecommerce  Network  (keys  are  in  CAPS) 


could  prevent  an  interruption  by  assuming  the  same  interpretation.  Second,  the 
result  of  an  on-the-fly  dialogue,  no  matter  what  the  user  answers,  may  end  up 
being  moot.  In  our  phone  example,  a  local  database  may  be  eliminated  from  the 
search  space  depending  on  which  phone  type  the  user  specified  in  the  dialogue.  If 
that  local  repository  has  already  been  eliminated  due  to  other  query  constraints, 
then  asking  the  user  for  clarification  is  fruitless.  Third,  by  using  the  log  data 
of  past  queries,  it  is  possible  to  detect  when  the  result  of  an  on-the-fly  dialogue 
would  only  result  in  an  insignificant  increase  in  the  number  of  returned  tuples.  In 
this  situation,  the  interpretation  which  produced  the  greater  number  of  answers 
could  be  assumed  with  insignificant  effect  on  processing  time  or  user  time  when 
sorting  through  the  results.  In  general,  on-the-fly  dialogue  could  be  bypassed  if 
the  effect  of  “guessing  wrong”  was  negligible. 


4  Preliminary  Results  and  Future  Work 

We  have  already  implemented  our  profile  generators,  DBA  toolkit  for  making 
generators,  and  our  annotated  mediator.  We  have  also  completed  preliminary 
experiments  using  concepts  and  queries  from  an  ecommerce  domain.  In  these 
experiments,  both  novice  and  expert  computer  users  issued  queries  to  a  set  of 
three  databases  that  each  performed  on-line  buying  and  selling  in  a  different  way. 
The  first  sold  goods  and  services  at  fixed  prices,  the  second  auctioned  off  goods 
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Table  1.  Results  from  Experiment 


Query 

Number  of  Number  of  Probability  of 

User  Level  Matches  Mismatches  Independence 

Query  1 

novice  3  23 

.0008 

expert  0  7 

Query  2 

novice  12  14 

.7001 

expert  3  4 

Query  3 

novice  15  11 

.3296 

expert  5  2 

Query  4 

novice  9  j  7 

.  .2538 

expert  4  3 

Query  5 

novice  16  10 

w  .1154 

expert  6  1 

to  the  highest  bidder,  and  the  third  was  a  shopbot-based  database  that  worked 
in  a  similar  fashion  to  priceline.  com.  Their  schemas  are  shown  in  Figure  7. 

Users  were  first  asked  to  create  a  profile  which  recorded  their  own  opinions 
concerning  common  terms  used  in  this  domain.  For  example,  a  user  was  asked 
to  define  the  term  “Buyer”  given  the  two  interpretations  presented  previously 
in  Section  3.3  and  shown  in  Figure  6.  Then,  each  user  was  shown  the  same 
set  of  test  queries  and  asked  to  profess  what  they  considered  to  be  “correct” 
answers  based  on  their  profiles  and  responses  to  on-the-fly  dialogue.  The  users 
then  compared  the  answers  returned  by  the  system  to  the  answers  they  thought 
would  be  “correct”  to  see  if  the  system  could  properly  discern  their  interpretation 
of  the  query.  Chi-Square  analysis  was  performed  on  the  raw  data  to  test  for  a 
relationship  between  the  two  user  groups.  The  results,  displayed  in  Table  1,  show 
the  number  of  users  whose  query  interpretations  matched  those  of  the  system 
alongside  those  whose  interpretations  did  not  match. 

Several  conclusions  can  be  drawn  from  the  data.  The  low  number  of  matches 
for  the  first  query  is  not  surprising  since  the  idea  of  multiple  query  interpretations 
is  a  new  one  to  most  users.  This  query  represents  a  training  period  for  the  user. 
As  time  went  on,  the  user’s  interpretation  of  the  query  was  correctly  reflected  by 
the  system’s  interpretation  with  increasing  accuracy.  The  continually-reducing 
probability  of  independence  between  the  two  user  groups  indicates  that  expert 
users  tended  to  become  adept  at  using  the  system  in  a  shorter  amount  of  time 
than  novices.  Because  non-experts  are  the  target  audience  for  a  system  of  this 
type,  we  hope  in  the  future  to  adjust  the  wording  of  questions  in  the  profile  ge¬ 
nerator  and  to  provide  a  more  robust  training  period  using  more  sample  queries. 
It  is  hoped  that  these  will  provide  a  ramp  to  proficiency  that  is  neither  too  steep 
nor  too  long  for  either  type  of  user. 
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5  Conclusion 

We  believe  that  the  discovery  of  intent  in  queries  is  an  important  aspect  in 
determining  if  returned  answers  are  indeed  “correct”  or  not.  If  a  system  returns 
tuples  based  on  its  own  static  interpretation  of  query  terms,  regardless  of  how 
broadly  encompassing  those  terms  may  be,  the  lack  of  consideration  for  the  user’s 
expectations  will  always  leave  room  for  misinterpretation. 

We  have  developed  and  implemented  a  system  that  attempts  to  take  the 
user’s  world  assumptions  into  account  by  using  profiles  and  on-the-fly  dialogue 
to  guide  the  query  translation  process.  Using  preferences  specified  both  before 
and  during  run-time,  a  single  query  can  return  different  sets  of  answers  that 
correspond  to  what  different  users  had  in  mind.  Preliminary  empirical  evidence 
is  favorable,  though  we  hope  to  do  more  experiments  to  increase  proficiency  and 
decrease  training  time. 
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Abstract.  This  paper  presents  techniques  for  discovering  and  matching 
rules  with  elastic  patterns.  Elastic  patterns  are  ordered  lists  of  elements 
that  can  be  stretched  along  the  time  axis.  Elastic  patterns  are  useful  for 
discovering  rules  from  data  sequences  with  different  sampling  rates.  For 
fast  discovery  of  rules  whose  heads  (left-hand  sides)  and  bodies  (right- 
hand  sides)  are  elastic  patterns,  we  construct  a  trimmed  suffix  tree  from 
succinct  forms  of  data  sequences  and  keep  the  tree  as  a  compact  represen¬ 
tation  of  rules.  The  trimmed  suffix  tree  is  also  used  as  an  index  structure 
for  finding  rules  matched  to  a  target  head  sequence.  When  matched  ru¬ 
les  cannot  be  found,  the  concept  of  rule  relaxation  is  introduced.  Using 
a  cluster  hierarchy  and  relaxation  error  as  a  new  distance  function,  we 
find  the  least  relaxed  rules  that  provide  the  most  specific  information  on 
a  target  head  sequence.  Experiments  on  synthetic  data  sequences  reveal 
the  effectiveness  of  our  proposed  approach. 


1  Introduction 

Rule  discovery  from  sequential  data  is  a  data  mining  technique  for  trend  pre¬ 
diction  [3]  [7],  There  have  been  several  approaches  [1]  [6]  [9]  [14]  to  discover  useful 
rules  from  patterns  occurring  frequently  in  data  sequences.  A  pattern  is  defined 
as  a  partially  ordered  collection  of  elements.  According  to  the  constraints  on 
the  arrangement  of  elements,  patterns  can  be  classified  as  serial  patterns  and 
parallel  patterns  [9]. 

As  a  subset  of  serial  patterns,  we  can  think  of  elastic  patterns  where  elements 
can  be  stretched  along  the  time  axis  by  replicating  themselves.  Elastic  patterns 
AB  and  ABC  are  interpreted  as  A+B+  and  A+B+C+,  respectively,  using  the 
notation  of  a  regular  expression.  (A,  B)  and  {A,  A,  B,  B,  B)  are  instances  of  an 
elastic  pattern  AB  while  ( A ,  C,  B)  is  not.  Elastic  patterns  are  useful  for  disco¬ 
vering  rules  from  data  sequences  whose  sampling  rates  may  vary.  For  example, 
consider  medical  data  sequences  that  record  the  body  temperatures  of  pati¬ 
ents.  Some  data  sequences  may  have  temperature  values  taken  every  day  while 
others  may  have  values  taken  every  week.  Furthermore,  even  within  a  single 
data  sequence,  time  intervals  between  neighboring  temperature  values  can  vary 
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non-linearly.  These  sequences  cannot  be  compared  directly  without  considering 
stretches  or  compressions  of  elements  along  the  time  axis. 

The  rules  whose  heads  (left-hand  sides)  and  bodies  (right-hand  sides)  are 
elastic  patterns  are  called  elastic  rules.  Given  elastic  patterns  a  and  (3,  elastic 
rules  have  the  format  ‘a  — >  (3'  that  is  interpreted  as  “if  there  occurs  a  sequence 
which  is  an  instance  of  a,  then  it  will  be  followed  by  a  sequence  which  is  an 
instance  of  /3”.  Time  intervals  are  not  associated  with  elastic  patterns  because 
these  patterns  are  flexible  on  the  time  axis. 

There  are  many  techniques  [1]  [6]  [9]  [14]  to  discover  rules  with  serial  patterns. 
Many  of  them  use  the  relationship  between  patterns  and  their  sub-patterns. 
Given  the  serial  patterns  AB  occurring  200  times  and  ABAC  occurring  150 
times,  they  extract  the  rule  lAB  — >  AC  with  confidence  |§g(=  0.75)’.  Infrequent 
patterns  whose  numbers  of  occurrences  are  below  a  threshold  are  ignored  because 
infrequent  patterns  are  considered  insignificant.  To  find  frequent  patterns,  they 
first  find  short  frequent  patterns  and  then  combine  them  to  generate  longer 
candidate  patterns.  Candidate  patterns  are  checked  whether  they  are  frequent  or 
not.  These  combining  and  checking  steps  are  repeated  until  all  frequent  patterns 
are  found.  Therefore,  repeated  readings  of  data  sequences  are  unavoidable. 

Once  rules  are  discovered  from  data  sequences,  they  may  be  used  to  predict 
the  future  trend  of  a  target  head  sequence  q  via  the  process  of  rule  matching. 
We  say  that  a  rule  is  matched  to  q  when  each  element  of  the  rule  head  is  equal 
to  the  corresponding  element  of  q.  However,  if  there  are  large  number  of  rules, 
it  is  not  a  trivial  task  to  find  rules  efficiently  that  are  matched  to  q. 

There  are  some  occasions  when  we  fail  to  find  rules  matched  to  a  target 
head  sequence  q.  This  failure  often  occurs  when  q  is  not  a  frequent  sequence. 
For  those  infrequent  target  head  sequences,  we  can  introduce  the  concept  of  rule 
relaxation.  Based  on  a  cluster  hierarchy,  a  rule  R  is  relaxed  to  R'  by  replacing 
some  elements  of  R  with  elements  denoting  more  general  concepts  or  broader 
range  values.  Given  a  target  head  sequence  q  and  a  rule  R  that  is  not  matched  to 
q,  we  can  relax  R  to  R'  so  that  R'  covers  q.  We  say  that  a  rule  covers  q  when  each 
element  of  the  rule  head  represents  the  same  range  as  or  broader  range  than  the 
one  represented  by  the  corresponding  element  of  q.  Among  many  relaxed  rules 
that  can  cover  q. ,  we  are  interested  in  finding  the  least  relaxed  rules  since  they 
describe  q  more  accurately  than  the  other  relaxed  rules. 

In  this  paper,  we  investigate  the  problems  of  discovering  and  matching  elastic 
rules  for  data  sequences  with  different  sample  rates.  An  efficient  rule  discover¬ 
ing  algorithm  is  developed  and  algorithms  for  exact  and  relaxed  rule  matching 
algorithms  are  presented. 


2  Background 

2.1  Suffix  Tree 

A  suffix  tree  [13]  is  an  index  structure  that  has  been  used  as  a  fast  access  method 
to  locate  substrings  (or  subsequences)  that  are  exactly  matched  to  a  target  string 
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(or  a  target  sequence).  The  suffix  tree  structure  is  based  on  tries  and  suffix  tries. 
A  trie  is  an  indexing  structure  used  for  indexing  sets  of  keywords  of  varying  sizes. 
A  suffix  trie  is  a  trie  whose  set  of  keywords  comprises  the  suffixes  of  sequences. 
Nodes  of  a  suffix  trie  with  a  single  outgoing  edge  can  be  collapsed,  yielding  a 
suffix  tree.  We  use  the  notation  PN  for  the  parent  node  of  iV,  and  the  notation 
label (Ni,  Nj )  for  the  labels  on  the  path  connecting  nodes  Nt  and  Nj. 

2.2  Type  Abstraction  Hierarchy 

Type  Abstraction  Hierarchy  (TAH)  [5]  is  a  data-driven  multi-level  cluster  hier¬ 
archy  that  uses  relaxation  error  as  a  goodness  measure  for  generating  clusters. 
For  a  cluster  C  =  {aq , x2, ...,  xn}  of  n  elements,  the  relaxation  error  for  C  is  de¬ 
fined  as  RE(C)  =  Y^=i  Xq=i  P(xi)P(xj )  I  Xi~Xj  |  where  P(x,)  and  P(xj)  are 
occurring  probabilities  of  x,  and  Xj ,  respectively.  The  algorithms  for  generating 
binary  and  n-ary  TAHs  are  given  in  [5].  TAH  is  easier  to  implement  than  the 
maximum-entropy  clustering  method  and  generates  more  accurate  clusters  than 
the  equal-length  interval  clustering  method.  Figure  1  shows  an  example  TAH 
built  from  a  distribution  of  data  sequences  whose  elements  take  values  within 
the  range  of  [0, 7.0).  The  relaxation  error  and  the  value  range  are  stored  in  each 
node,  and  the  nodes  are  labeled  with  unique  symbols. 


u 


Fig.  1.  An  example  of  TAH.  Each  node  is  labeled  with  a  unique  symbol.  The  value 
range  and  the  corresponding  relaxation  error  are  stored  at  each  node. 


3  Rule  Discovery 

In  this  section,  we  propose  an  efficient  method  to  discover  elastic  rules  from  data 
sequences  via  a  suffix  tree.  We  assume  that  the  TAH  has  been  generated  from 
data  sequences  and  distinct  symbols  have  been  assigned  to  the  TAH  nodes. 

The  support  value  of  the  pattern  a  is  defined  as  the  number  of  suffixes  having 
a  as  their  prefixes.  SU Pmin  is  the  minimum  support  value  that  is  used  to  filter 
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out  infrequent  patterns.  We  also  define  the  relative  support  value  of  the  pattern 
a  as  RSUP(a)  =  (the  number  of  suffixes  having  a  as  their  prefixes)  /  (the 
total  number  of  suffixes).  For  the  applications  where  the  total  number  of  data 
sequences  and  their  lengths  may  vary,  the  relative  support  is  better  than  the 
(absolute)  support. 

The  problem  of  elastic  rule  discovery  is  defined  as  follows:  Given  a  database 
with  M  sequences  X\,X2,  —,xm  and  the  minimum  support  value  SUPmin,  disco¬ 
ver  rules  composed  of  elastic  patterns  whose  supports  are  at  least  SUPmin.  This 
elastic  rule  discovery  consists  of  the  following  five  steps. 

Step  1.  Converting  numeric  elements  to  symbol  elements:  We  con¬ 
vert  each  numeric  element  of  data  sequences  into  the  symbol  of  the  corre¬ 
sponding  leaf  node  of  the  TAH.  The  symbolized  representation  of  x  is  deno¬ 
ted  as  S(x).  For  example,  based  on  the  TAH  in  Figure  1,  a  data  sequence 
x  =  (3.4, 3.0, 3.7, 2.3, 2.1, 4.3)  is  converted  to  S(x)  =  (C,  C,  C,  B,  B,  D). 

Step  2.  Compaction:  We  convert  the  symbolized  data  sequence  S(x)  into  the 
compact  representation  C(S(x))  by  replacing  consecutive  elements  that  have  the 
same  value  with  a  single  element  of  that  value.  This  step  is  for  considering  the 
property  of  elastic  patterns.  For  example,  S(x)  =  ( C ,  C,  C,  B,  B,  D)  is  converted 
to  C(S(x))  =  (C,B,D).  We  use  the  notation  X  for  C(S(x)). 

Step  3.  Suffix  tree  construction:  From  the  set  of  M  converted  data  sequen¬ 
ces  Xi,  we  build  a  suffix  tree  using  either  McCreight’s  algorithm  [8]  or 

incremental  disk-based  algorithm  [4], 

Step  4.  Trimming:  We  compute  the  support  values  of  the  nodes  and  trim  out 
the  nodes  whose  support  values  are  less  than  SUPmin.  The  support  values  of 
internal  nodes  are  obtained  by  summing  up  the  support  values  of  their  children 
nodes.  The  support  values  of  the  leaf  nodes  are  the  same  as  the  number  of  suf¬ 
fixes  represented  by  the  leaf  nodes.  The  trimmed  suffix  tree  is  called  the  rule  tree. 

Step  5.  Rule  extraction:  We  compute  the  confidence  values  of  nodes  and  then 
extract  rules.  The  expression  for  computing  the  confidence  value  of  the  node  N 
is  confidence(N)  =  Support(N)  / Support(PN)  where  PN  is  the  parent  node  of 
N.  If  the  number  of  labels  on  the  path  from  PN  to  N  is  L,  we  extract  L  rules 
as  shown  in  the  following: 

Ri  :  label(rootN  ode,  PN)  — >  label(PN,  N) 

R2  :  label{rootN  ode,  PN)  •  {label(PN,  N)[  1  :  1])  label(PN,N)[2  :  L] 

R3  :  label{rootN ode,  PN)  •  (label(PN,  N)[  1  :  2])  ->  label(PN,N)[3  :  L\ 

Rl  :  label(rootN ode,  PN)  •  ( label(PN ,  AQ[1  :  L  —  1])  — »  label(PN,  N)[L  :  L ] 
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where  label(PN ,  N)[i  :  j]  is  the  subsequence  of  label(PN,  N )  including  elements 
in  positions  i  through  j  (i  <  j  <  L),  and  V  is  the  binary  operator  for  concate¬ 
nating  two  sequences.  If  N  is  the  root  node,  then  label{rootN ode,  PN)  becomes 
the  empty  sequence  ().  The  confidence  of  Ri  is  the  same  as  confidence(IV)  while 
the  confidences  of  R2,  Pi:  and  Rl  are  1.  Figure  2  shows  a  part  of  a  rule  tree 

and  the  rules  extracted  from  the  tree.  The  values  in  the  nodes  represent  their 
support  values. 


Nl 


(a)  A  part  of  a  rule  tree 


Fig.  2.  An  example  of  a  rule 
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(b)  Rules  extracted  from  a  rule  tree 

and  the  corresponding  rules. 


4  Exact  Rule  Matching 

Exact  rule  matching  is  defined  as  follows:  Given  a  rule  tree,  a  type  abstraction 
hierarchy,  and  a  target  head  sequence  q,  find  the  rules  matched  to  q.  Our  approach 
for  exact  rule  matching  consists  of  two  steps. 

Step  1.  Search  for  exactly  matched  rule  head:  Using  the  rule  tree  as  an 
index  structure,  we  find  the  rule  head  h  that  is  exactly  matched  to  a  target  head 
sequence.  Algorithm  1  shows  the  exact  matching  algorithm  RTI-E  (Rule  Tree 
Index  for  Exact  matching).  We  use  the  notation  Q  for  C(S(q)).  The  algorithm 
traverses  the  rule  tree  and  returns  a  pair  ( N ,  p)  that  represents  the  matched 
rule  head  h  =  label  (root  Node,  PN)  •  ( label(PN ,  IV)  [1  '■!->])■  The  first  call  to  the 
algorithm  has  two  arguments:  rootNode  and  Q. 

Step  2.  Rule  extraction  from  exactly  matched  rule  head:  Using  the 
relationship  between  the  exactly  matched  rule  head  and  its  following  subse¬ 
quences,  we  extract  the  rules.  Let  us  assume  that  RTI-E  has  returned  the  pair 
(N,  p )  and  the  length  of  label(PN,  N )  is  L.  If  p  <  L,  then  the  matched  rule 
is  ‘label(rootNode,  PN)  •  (label(PN,  N)[  1  :  p\)  — >  label  (PN,  N)  [p  +  1  :  L\ 
with  confidence  1’.  Otherwise,  the  number  of  matched  rules  is  the  same  as  the 
number  of  children  of  N.  For  each  child  node  CN  of  N,  the  matched  rule  is 
Llabel(rootNode,  N)  — >  label(N,CN)  with  confidence (CN)\ 
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Input  :  node  N,  target  head  sequence  Q 
Output:  child  node  CN,  length  of  matched  prefix 

Visit  the  node  N\ 

Select  the  child  node,  CN,  where  label(N,CN)  is  matched  to  the  prefix  of  Q ; 
Remove  the  matched  prefix  from  Q\ 

if  Q  becomes  empty  then  return  a  pair  (CN,  the  length  of  a  matched  prefix); 
else  call  RT\-E(CN,  Q); 

Algorithm  1:  Exact  matching  algorithm  RTI-E 


5  Relaxed  Rule  Matching 


Relaxed  rule  matching  is  defined  as  follows:  Given  a  rule  tree,  a  type  abstraction 
hierarchy,  and  a  target  head  sequence  q,  find  the  least  relaxed  rules  that  cover  q. 

Since  a  rule  head  h  whose  length  is  not  equal  to  \q\  may  be  stretched  and 
relaxed  to  cover  q,  we  propose  a  relaxation-error  based  time  warping  distance 
function  Dre (h,  q)  as  a  similarity  measure  of  h  and  q.  DreQi,  q)  stretches  h  and 
q  non-linearly  to  find  the  best  element  mappings  that  minimize  the  difference 
of  h  and  q.  Let  h[i]  be  mapped  to  q[j\.  Then,  the  distance  of  this  mapping  is 
defined  as  RE(h[i\,q[j])  —  RE(h[i\)  where  RE(h[i],q[j}))  is  the  relaxation  error 
of  the  lowest  node  containing  both  h[i]  and  q[j] ,  and  RE(h[i})  is  the  relaxation 
error  of  the  lowest  node  containing  h[i\.  The  detailed  description  of  Dre()i,  q) 
is  given  in  [10] 

Dre{Ii ,  q)  can  be  calculated  efficiently  by  the  dynamic  programming  techni¬ 
que  [2]  based  on  the  recurrence  relation  r(x,y).  ( x  =  1,...,  \h\,  y  =  1, |g|).  The 
final  cumulative  distance,  r(\h\,  |(/[),  is  the  amount  of  relaxation  needed  for  h  to 
cover  q.  Using  the  cluster  hierarchy  (Figure  1),  Figure  3  shows  the  cumulative 
distance  table  T  for  Dre(Ji  =  (U,  A,  E,  D,  A),  q  =  ( C,E,A ))  and  the  element 
mappings  after  time  warping  and  relaxation.  In  the  following,  we  present  the 
two-step  approach  for  relaxed  rule  matching. 

Step  1.  Search  for  the  nearest  rule  head:  To  generate  the  least  relaxed 
rules,  we  first  traverse  the  rule  tree  to  find  the  rule  head  h  that  requires  the  least 
relaxation  to  cover  a  target  head  sequence  q.  The  similarity  matching  algorithm 
RTI-S  (Rule  Tree  Index  for  Similarity  matching)  is  given  in  Algorithm  2.  Note 
that  a  target  head  sequence  q  is  converted  to  the  compact  representation  Q  be¬ 
fore  beginning  the  search  process.  The  algorithm  maintains  three  global  variables 
during  its  execution:  Q ,  the  nearest  rule  head  h  found  so  far,  and  its  distance 
MinDist  from  Q.  The  first  call  to  the  algorithm  has  two  arguments:  rootNode 
and  emptyTable.  RTI-S  reduces  the  search  time  by  applying  the  branch-pruning 
approach  [11]  and  by  allowing  the  cumulative  distance  table  to  be  shared  by  rule 
heads  that  have  the  same  prefix. 
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(a)  Cumulative  distance  table  T  for 
Dre{<CAE,D,A>,<C,EiA>) 


(b)  Mapping  of  elements  after  time  (c)  Mapping  of  elements  after  relaxation, 
warping.  The  first  and  the  third  The  second  element  (=A)  and  the  fourth 
element  of  query  sequence  are  element  (=D)  of  head  sequence  are 
replicated.  relaxed  to  X  and  Y,  respectively. 


Fig.  3.  Cumulative  distance  table  T  for  DreIH  —  ( C ,  A ,  E,D,A),q  —  (C,  E,  A))  and 
the  mapping  of  elements  after  time  warping  and  relaxation. 


Input  :  node  N,  cumulative  distance  table  T 

Visit  the  node  N; 

for  each  child  node  CN  do 

Build  a  new  cumulative  distance  table  newT,  by  adding  rows  corresponding 
to  label(N,CN)  on  T; 

Find  a  nearer  rule  head  h  from  newT  and  update  MinDist ; 

If  further  traverse-down  the  tree  is  necessary,  call  RTI-S(GW,  newT); 


Algorithm  2:  Similarity  matching  algorithm  RTI-S 


Step  2.  Rule  extraction  from  the  nearest  rule  head:  After  finding  the 
rule  head  h  most  similar  to  Q,  we  generate  the  least  relaxed  rules  from  h  and  its 
following  subsequences.  This  step  begins  with  extracting  the  rules  from  h  using 
the  method  explained  in  Section  4.  Then,  we  convert  symbols  of  rule  heads  and 
bodies  into  their  relaxed  symbols  according  to  the  mapping  of  h  and  Q,  and  get 
the  compact  representations  of  rules.  Finally,  the  rules  having  the  same  head 
and  body  are  merged  and  their  confidence  values  are  recomputed. 

6  Experiments 

To  study  the  effectiveness  of  our  proposed  methods,  we  performed  experiments 
on  the  random  walk  synthetic  data  sequences  [10].  We  used  the  relative  minimum 
support  value  RSU Pmin  to  control  the  number  of  discovered  rules. 

6.1  Rule  Discovery 

We  used  the  total  elapsed  time  as  a  performance  measure  of  our  rule  discovery 
algorithm.  First,  we  increased  the  number  of  sequences  from  100  to  10,000  while 
keeping  their  average  length  constant  at  200.  Then,  we  changed  the  average 
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length  of  sequences  from  100  to  1,000  while  maintaining  the  number  of  sequences 
at  500.  As  shown  in  Figures  4  and  5,  the  total  elapsed  times  increase  linearly  as 
the  number  of  and  the  average  length  of  data  sequences  grow.  The  figures  also 
show  that  the  linearity  is  maintained  with  different  RSUPmin  values. 


number  of  data  sequences 


Fig.  4.  Total  elapsed  time  for  discover¬ 
ing  elastic  rules  with  selected  numbers 
of  data  sequences. 


number  of  rides 


Fig.  6.  Performance  comparison  bet¬ 
ween  sequential  scanning  and  RTI-E  for 
exact  rule  matching. 


averege  length  of  data  sequences 

Fig.  5.  Total  elapsed  time  for  discover¬ 
ing  elastic  rules  with  selected  average 
length  of  data  sequences. 
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number  of  rules 

Fig.  7.  Performance  comparison  bet¬ 
ween  sequential  scanning  and  RTI-S  for 
relaxed  rule  matching. 


6.2  Rule  Matching 

For  500  data  sequences  with  the  average  length  400,  Figure  6  shows  the  aver¬ 
age  search  times  of  RTI-E  and  SS(Sequential-  Scanning)-based  exact  matching 
algorithm  for  increasing  numbers  of  rules.  The  search  times  of  SS-based  exact 
matching  algorithm  increase  linearly  with  the  number  of  rules  while  the  search 
times  of  RTI-E  remain  relatively  constant.  Figure  7  shows  the  average  search 
times  of  RTI-S  and  SS-based  similarity  matching  algorithm.  The  performance 
gain  of  RTI-S  increases  as  the  number  of  rules  increases. 
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7  Conclusion 

In  this  paper,  we  proposed  a  method  to  discover  elastic  rules  from  sequence  data¬ 
bases.  We  also  presented  efficient  techniques  to  find  matched  rules  and  to  derive 
the  least  relaxed  rules.  We  focused  on  data  sequences  consisted  of  univariate  nu¬ 
meric  values.  If  elements  are  non-numeric,  we  employ  an  encoding  scheme  that 
converts  non-numeric  elements  to  numeric  elements. 

Experiments  on  synthetic  data  sequences  revealed  that:  1)  our  rule  disco¬ 
vering  algorithm  is  linear  to  both  the  total  number  of  and  the  average  length 
of  data  sequences,  and  2)  our  exact  and  relaxed  rule  matching  algorithms  are 
several  orders  of  magnitude  faster  than  sequential  scanning. 

References 

1.  R.  Agrawal,  and  R.  Srikant,  “Mining  Sequential  Patterns”,  Proc.  IEEE  ICDE,  1995. 

2.  D.  J.  Berndt,  and  J.  Clifford,  “Finding  Patterns  in  Time  Series:  A  Dynamic 
Programming  Approach”,  Advances  in  Knowledge  Discovery  and  Data  Mining , 
AAAI/MIT,  1996. 

3.  P.  S.  Bradley,  U.  M.  Fayyad,  and  O.  L.  Mangasarian,  “Data  Mining:  Overview  and 
Optimization  Opportunities”,  Microsoft  Research  Report  MSR-TR-98-04 ,  1998. 

4.  P.  Bieganski,  J.  Riedl,  and  J.  V.  Carlis,  “Generalized  Suffix  Trees  for  Biological 
Sequence  Data:  Applications  and  Implementation”,  Proc.  Hawaii  Int’l  Conf.  on 
System  Sciences,  1994. 

5.  W.  W.  Chu,  and  K.  Chiang,  “Abstraction  of  High  Level  Concepts  from  Numeri¬ 
cal  Values  in  Databases”,  Proc.  of  AAAI  Workshop  on  Knowledge  Discovery  in 
Databases,  1994. 

6.  G.  Das,  K.  Lin,  H.  Mannila,  G.  Renganathan,  and  P.  Smyth,  “Rule  Discovery  from 
Time  Series”,  Proc.  International  Conference  on  Knowledge  Discovery  and  Data 
Mining,  1998. 

7.  U.  M.  Fayyad,  “Mining  Databases:  Toward  Algorithms  for  Knowledge  Discovery”, 
Data  Engineering  Bulletin  21(1),  1998. 

8.  E.  M.  McCreight,  “A  Space-Economical  Suffix  Tree  Construction  Algorithm”,  Jour¬ 
nal  of  the  ACM,  Vol.  23,  No.  2,  1976 

9.  H.  Mannila,  H.  Toivonen,  and  A.  I.  Verkamo,  “Discovering  Frequent  Episodes  in 
Sequences”,  Proc.  International  Conference  on  Knowledge  Discovery  and  Data  Mi¬ 
ning,  1995. 

10.  S.  Park  and  W.  W.  Chu,  “Discovering  and  Matching  Elastic  Rules  from  Sequence 
Databases”,  UCLA  Technical  Report  UCLA-CS-TR-200012,  2000. 

11.  S.  Park,  W.  W.  Chu,  J.  Yoon,  and  C.  Hsu,  “Efficient  Searches  for  Similar  Subse¬ 
quences  of  Different  Lengths  in  Sequence  Databases”,  Proc.  IEEE  ICDE,  2000. 

12.  L.  Rabinar,  and  B.  Juang.  Fundamentals  of  Speech  Recognition,  Prentice  Hall, 
1993. 

13.  G.  A.  Stephen,  String  Searching  Algorithms,  World  Scientific  Publishing,  1994. 

14.  J.  T.-L.  Wang,  G.-W.  Chirn,  T.  G.  Marr,  B.  Shapiro,  D.  Shasha,  and  K.  Zhang, 
“Combinatorial  Pattern  Discovery  for  Scientific  Data:  Some  Preliminary  Results”, 
Proc.  ACM  SIGMOD,  1994. 


Perception-Based  Granularity  Levels  in  Concept 
Representation 


Lorenza  Saitta1  and  Jean-Daniel  Zucker^ 

^  Univ.  del  Piemonte  Orientale  Dip.  di  Scienze  e  Tecnologie  Avanzate 
Corso  Borsalino  54, 15100  Alessandria  (Italy) 
saitta@di.unito.it 

2Universite  Paris  VI  -  CNRS,  Laboratoire  d’Informatique  de  Paris  6 
4,  Place  Jussieu,  F-75252  Paris  (France) 
Jean-Daniel.Zucker@lip6.fr 

Abstract.  In  this  paper  we  propose  a  perception-based  view  of  abstraction,  which 
originates  from  the  observation  that  conceptualization  of  a  domain  involves  entities 
belonging  to  several  epistemological  levels.  The  fundamental  level  corresponds  to 
the  perception  of  a  world.  For  memorization  purposes,  some  kind  of  structure  is 
needed,  in  order  to  organize  objects  and  relations  perceived  in  the  world  into 
coherent  ensembles.  To  communicate  with  others,  a  language  must  be  invented, 
and,  finally,  a  theory  makes  it  possible  to  reason  about  the  world.  After  discussing 
suitable  properties  abstraction  should  have  to  be  useful  for  concept  representation, 
examples  of  abstraction  operators,  designed  to  perform  the  abstraction  process  in 
practice,  will  be  introduced. 

1  Introduction 

Abstraction,  intended  as  the  ability  to  forget  irrelevant  details  and  to  find  simpler 
descriptions,  has  been  mainly  investigated  in  problem  solving  [16,  20,  5,  3,  8],  and  in 
problem  reformulation  [12,  1,  18,  24],  In  this  paper  we  are  interested,  instead,  in  the 
role  played  by  abstraction  in  a  phase  preceding  problem  solving,  namely  the  phase  of 
conceptualizing  a  domain,  when  a  set  of  appropriate,  possibly  interrelated  concepts  is 
defined. 

In  problem  solving  and  problem  reformulation,  abstraction  consists  of  a 
transformation  of  the  representation  language  that  allows  a  theorem  to  be  proved  (or  a 
problem  to  be  solved)  more  easily,  i.e.,  with  a  reduced  computational  effort.  This 
pragmatic  view  of  abstraction,  which  proved  very  useful  to  its  intended  goal,  may  not 
be  sufficient  for  concept  definition,  where  computational  issues,  even  though 
important,  are  subsequent  to  the  establishing  of  meaningful  relations  between  the 
“concepts”  and  their  referents  in  the  world.  In  concept  representation,  in  fact  the  role 
of  abstraction  seems  more  related  to  “making  sense”  of  the  perception  of  the  world, 
by  transforming  it  into  a  set  of  meaningful  “concepts”,  prior  than  to  an  efficient  use  of 
them.  Abstraction  is  thus  a  fundamental  mechanism  for  saving  cognitive  efforts,  by 
offering  us  a  “higher”  level  view  of  our  physical  and  intellectual  environment. 
Goldstone  and  Barsalou  [6]  have  recently  advocated  a  stricter  link  between  perception 
and  conceptualization  in  Cognitive  Science.  We  think  that  their  approach  offers  a 
cognitive  foundation  to  our  model  of  abstraction. 
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2  Related  Work 

Plaisted  [16]  has  provided  foundations  of  theorem  proving  with  abstraction,  seen  as  a 
function  from  a  set  of  clauses  to  another  one  that  satisfies  some  properties  related  to 
the  resolution  principle.  Tenenberg  [20]  starts  from  Plaisted ’s  work  to  define  an 
abstraction  as  a  mapping  between  predicates  that  preserves  logical  consistency.  He 
defines  an  abstraction  either  as  a  predicate  mapping,  or  a  mapping  between  clauses 
based  on  predicate  mappings,  where  only  consistent  clauses  are  considered. 

Giunchiglia  and  Walsh  [5]  reviewed  most  of  the  work  done  in  reasoning  with 
abstraction.  Extending  Tenenberg's  work,  Nayak  and  Levy  [14]  proposed  a  semantic 
theory  of  abstraction.  This  theory  defines  abstraction  as  a  model  level  mapping  rather 
than  predicate  mapping.  This  semantic  theory  yields  model  increasing  abstractions 
that  are  weaker  than  the  base  theory,  i.e.,  they  are  strictly  a  subset  of  the  "theorem 
decreasing"  abstractions  introduced  in  [5]. 

Abstraction  has  also  been  studied  in  relation  with  change  of  problem 
representation  [2,1,18,19,11,12].  The  notion  of  granularity  is  related  to  the  analysis  of 
possible  links  between  levels  of  abstractions  [7,  9,  15]. 


3  Definition  of  Perception-Based  Abstraction 

The  novel  perspective  on  abstraction  that  we  propose  originates  from  the  observation 
that  the  conceptualization  of  a  domain  involves  at  least  four  different  levels. 
Underlying  any  source  of  experience  there  is  the  world  W.  We  consider  a  fixed  part 
of  the  world,  and  we  assume  further  that  it  does  not  change  over  time.  The  world  is 
not  really  known,  because  we  only  have  a  mediated  access  to  it  through  our 
perception.  An  actual  world  perception  P  is  obtained  through  a  process  P  of 
signal/information  acquisition  from/about  the  world: 

P  =  P  (W) 

Even  though  the  primary  source  of  information  is  the  flow  of  sensory  perceptions 
from  the  world,  we  cannot  go  back  to  them  every  time  we  approach  a  new  task.  Then, 
when  we  consider  a  world,  we  can  think  of  using  a  perception  system  P  more 
complex  than  the  one  sufficient  to  capture  basic  signals;  P  detects  objects,  properties 
and  relations  specified  by  P  through  mechanisms  that  we  leave  implicit.  More 
precisely,  in  the  world,  both  atomic  and  compound  objects  can  be  perceived.  Atomic 
objects  do  not  have  parts,  whereas  compound  objects  do  have  parts  that  are 
themselves  objects:  a  part-of  hierarchy  relates  compound  objects  to  their 
constituents.  Single  objects  (both  atomic  and  compound  ones)  have  properties,  which 
we  call  attributes.  Other  types  of  properties  involve  groups  of  objects,  resulting  in 
functions  and  relations.  The  percepts  in  P,  provided  by  P( W),  can  be  grouped  into 
four  classes: 

P  =  <  OBJ,  ATT,  FUNC,  REL  >. 

OBJ  is  a  set  of  objects,  ATT  is  a  set  of  object  attributes,  FUNC  is  a  set  of 
functional  links,  and  REL  is  a  set  of  relations. 
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At  the  perception  level,  the  percepts  “exist”  only  for  the  observer,  and  only  during 
the  act  of  perceiving.  In  order  to  let  the  stimuli  become  available  over  time,  for 
retrieval  and  further  reasoning,  they  must  be  memorized  and  organized  into  a 
structure  S  [22].  This  structure  is  an  extensional  representation  of  the  perceived 
world,  in  which  stimuli  perceptually  related  one  to  another  are  stored  together.  In  an 
artificial  system,  storage  occurs  in  a  relational  database,  manipulated  via  relational 
algebra  operators  [21].  We  will  denote  by  7)t  the  memorization  process: 

S  =  %  (P) 

Finally,  in  order  to  describe  in  a  symbolic  way  the  perceived  world,  and  to 
communicate  with  other  agents,  a  language  L  is  needed.  L  allows  the  perceived 
world  to  be  described  intensionally.  Assigning  “names”  to  the  tables  in  the  structure  S 
and  to  their  entries  is  a  process  of  description: 

L  =  Z>(S) 

Finally,  a  theory  allows  reasoning  about  the  world.  The  theory  may  also  contain 
general  knowledge,  which  does  not  belong  to  any  specific  domain.  At  the  theory  level 
inference  rules  are  used.  We  call  formalization  the  process  of  expressing  the  theory  in 
the  language  L  (possibly  enriched  to  accommodate  domain-independent  background 
knowledge): 

T  =  ?(L) 

The  four  levels  are  ordered  as  in  Figure  1 .  The  meaning  of  an  arrow  X  — >  Y  and, 
consequently,  of  the  functional  denotation  Y  =  ^(X),  indicates  that  the  syntactic  and 
semantic  definition  of  level  Y  must  take  into  account  the  content  of  level  X. 

As  no  world  is  totally  isolated,  a  body  of  background  knowledge  provides,  in 
principle,  input  to  each  level,  especially  to  the  theory,  where  general  laws  and 
domain-independent  facts  may  be  needed. 


Fig.  1.  Levels  involved  in  representing  and  reasoning  about  a  world  W.  P  denotes  a  perception 
of  objects  and  their  physical  links  in  W.  S  is  a  set  of  tables,  each  one  grouping  objects  sharing 
some  property.  L  is  a  formal  language,  whose  semantics  is  evaluated  on  the  tables  of  S.  Finally, 
T  is  a  theory  formulated  using  L,  in  which  properties  of  the  world  and  general  knowledge  are 
embedded.  General  background  knowledge  may  provide  inputs  to  any  level. 

In  this  paper  we  consider  the  levels  as  given,  and  we  do  not  discuss  the  nature  of 
the  P,  W,  V,  and  ?  processes.  Instead,  we  concentrate  on  representational  issues  and 
define  a  Description  Framework  D(W),  over  a  world  W,  as  the  4-tuple  D(W)  =  (P, 
S,  L,  T).  Given  a  world  W,  let  P  be  a  perception  of  W  resulting  from  a  process  P  that 
uses  a  set  of  sensors,  each  one  tailored  to  capture  a  particular  signal.  Each  sensor  has 
a  resolution  threshold  that  establishes  the  minimum  difference  between  two  signals  in 
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order  to  consider  them  as  distinct.  A  set  of  values  provided  by  the  sensors  in  P  is 
called  a  signal  pattern  or  a  configuration.  Let  T  be  the  set  of  possible  configurations 
detectable  by  P.  In  order  to  capture  the  intuitive  notion  that  a  more  abstract 
configuration  should  be  described  in  a  simpler  way,  and  that  its  simplicity  should  not 
depend  upon  the  tool  used  for  the  description,  but  should  be  intrinsic  to  the 
configuration  itself,  we  can  exploit  the  notion  of  algorithmic  complexity  introduced 
by  Kolmogorov  [13]. 

Given  P,  and,  hence,  T,  any  configuration  ye  F can  be  described  by  a  program  n 
on  a  universal  computer  U.  Then,  a  complexity  measure  can  be  associated  to  T 
according  to  the  following  definition: 

Definition  1  -  Given  a  perception  system  P  and  the  associated  set  T,  let  y  e  T  be 
any  configuration.  Given  a  universal  machine  U  let 

K(y)  =  min  £(n)  +  C 

u(!t)=.y 

be  the  Kolmogorov  complexity  of  y,  defined  as  the  length  £  (n)  of  the  shortest 
program  n  that  output  y  on  the  universal  machine.  The  complexity  of  T  can  be 
defined  as: 

K(D  =  Max  K(y)  +  C  □ 

y.r 

The  above  definition  has  the  advantage  to  be  an  “absolute”  measure  of  the 
complexity  of  T,  because  it  is  machine-independent  up  to  an  additive  constant.  The 
additive  constant  C  can  be  interpreted  as  the  length  of  the  program  that  describes  the 
common  structure  of  the  configurations,  i.e.,  the  length  of  the  program  necessary  for 
the  machine  U  to  interpret  the  descriptions  7r.  Kolmogorov  complexity  is  not 
computable.  Nevertheless,  we  can  use  computable  approximations  of  ,  K  sufficient 
for  practical  purposes,  provided  that  the  same  approximation  is  used  uniformly  over 
all  the  Fs. 

Definition  2  -  Given  a  world  W,  let  P}  and  P^  be  two  perception  processes 
generating  Pi  and  P2,  respectively.  Let  T,  and  T2  be  the  corresponding  configuration 
sets.  The  perceived  world  P2  will  be  said  simpler  ,  according  to  Definition  1,  than  P, 
iff: 

K(T2)<  K(r,)  □ 

The  above  definition  has  the  advantage  of  linking  simplicity  to  its  semantic 
meaning  of  cognitive  effort  of  the  information  processing,  rather  to  its  syntactic 
expression.  Obviously,  syntactic  complexity  may  have  an  effect  on  the  simplicity  of  a 
perceived  world,  when  a  higher  syntactic  complexity  implies  more  work  to  handle  the 
conveyed  information. 

Definition  3  -  Given  a  world  W,  let  Dg(W)  =  (Pg,  Sg,  Lg,  Tg)  and  Da(W)  =  (Pa, 
Sa,  La,  Ta)  be  two  description  frameworks  over  the  same  world  W,  which  we 
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conventionally  label  as  ground  and  abstract,  respectively.  An  Abstraction  is  a 
mapping 

A  '.  Dg(W)  — >  Da(W) 

such  that  Pa  is  “simpler”  than  Pg.  □ 

Definition  3  is  very  general,  and  does  not  impose  any  “semantic  ”  link  between  T 
and  T;  it  only  states  that  F  is  simpler  to  describe.  Abstraction  is,  of  course,  a 
transitive  relation  [10,  23],  and  chains  of  abstraction  mapping  can  be  considered. 
Notice  that  abstraction  is  not  concerned  with  the  probability  distribution  of  the 
configurations  belonging  to  F  just  one  configuration,  the  observed  one  y,  is  relevant. 
Also,  the  target  problem  is  not  recognizing  the  configuration,  but  describing  it. 

Assuming  that  a  ground  representation  framework  Dg  (W)  =  (Pg,  Sg,  Lg,  Tg)  has 
been  defined  over  a  world  W,  we  will  now  investigate  the  relations  between  Dg  and  a 
more  abstract  representation  framework  Da(W)  =  (Pa,  Sa,  La,  Ta).  These  relations 
are  schematically  illustrated  in  Figure  2. 


Fig.  2.  Abstraction  mapping  between  a  ground  representation  framework  Dg(W)  =  (Pg,  Sg, 
Lg,  Tg)  and  a  more  abstract  one  Da(W)  =  (Pa,  Sa,  La,  Ta). 

The  symbols  go,  a,  A.  and  x  denote  abstraction  operators  belonging  to  four  sets  ff,,., 
Qs,  Ol,  and  Op,  respectively  acting  at  the  four  considered  levels. 

In  Figure  2  two  dimensions  appear:  the  vertical  one,  along  which  the  nature  of  the 
representation  changes,  and  the  horizontal  one,  along  which  abstraction,  for  each  type 
of  description,  occurs.  As  abstraction  is,  in  the  proposed  model,  grounded  on 
perception,  the  P-operators  are  the  ones  that  have  to  be  defined  first,  whereas  the  other 
operators  must  reflect,  if  possible,  the  perceptual  transformations.  If  a  perceptual 
transformation  linking  P  and  P'  exists,  no  problems  of  consistency  should  appear  [6]; 
instead,  it  is  possible  that  the  language  (and/or  the  theory)  becomes  unable  to  describe 
the  new  perceived  world.  In  the  next  section,  more  details  about  the  abstraction 
operators  will  be  given. 
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4  Abstraction  Operators 

The  previous  section  provides  a  formal  definition  of  abstraction  as  a  functional 
mapping  A  between  a  given  perception  of  a  world  and  a  simpler  one.  Given  A,  we 
would  like  to  introduce  operators  that  can  actually  operate  the  transformation. 


Definition  4  -  An  abstraction  P-operator  ©  denotes  a  procedure  that  takes  as  input 
a  perception  Pg(W)  of  a  world  W  and  outputs  a  simpler  perception  Pa(W)  of  the  same 
world.  The  domain  of  application  of  a  P-operator  represents  the  subset  of  the  world 
perceptions  to  which  it  applies.  □ 

Many  P-operators  can  be  defined,  depending  on  the  domain  and  task. 

Given  a  P-operator  ©,  in  order  to  obtain  the  structure  Sa  that  memorizes  the 
abstract  world  perception  Pa ,  it  would  be  enough  to  apply  the  7K  process  to  it, 
namely  to  define  Sa  =  7H,  (Pa). However,  building  the  structure  Sa  in  this  way  requires 
first  to  explicitly  build  the  abstract  world  perception  Pa  =  co(Pg(W))  and  then  to 
memorize  it.  Because  the  memorization  step  is  difficult  to  automate,  and  the  abstract 
perception  could  be  difficult  to  acquire,  it  would  be  more  useful  to  have  a 
transformation  that  works  directly  on  the  ground  structure  Sg.  Therefore,  we  shall 

define  the  notion  of  a  structure  level  operator  G  (S-operator),  compatible  with  a  P- 
operator  ©.  This  compatibility  provides  the  semantic  foundation  to  the  abstract 
operators  at  the  structure  level. 

Definition  5  -  An  S-operator  o,  applicable  at  the  structure  level,  is  compatible 
with  a  P-operator  ©,  at  the  world  perception  level,  iff: 

n  (©  (Pg(W)))  =  a  (  7ft  (Pg  (W)))  □ 

For  any  given  P-operator  ©,  there  is  no  guarantee  that  a  corresponding,  effective 
S-operator,  compatible  with  ©,  exists,  nor  that,  if  it  exists,  it  is  unique. 

Operators  that  directly  modify  languages  have  been  predominant  in  modeling 
abstraction  [16,  20,  5,  14].  Given  a  P-operator  ©  and  a  compatible  S-operator  ct,  in 
order  to  obtain  the  abstract  language  La  that  describes  the  abstract  structure  Sa  = 
o(Sg),  it  would  be  enough  to  apply  V  to  the  abstract  structure,  i.e.,  to  define  La  = 
Z?(o(Sg)).  However,  the  same  considerations  made  for  the  S-operators  hold  also  for 
the  language  level:  we  would  like  to  define  L-operators  that  operate  directly  on  Lg. 

Definition  6  -  Given  an  S-operator  a,  an  L-operator  X  at  the  language  level  is 
compatible  with  a  iff  A(f?(Sg))  =  Z>(a(Sg)).  □ 

Given  a  ground  language  Lg,  a  theory  Tg  is  expressed  as  a  set  of  formulas 
formalizing  a  body  of  knowledge.  These  formulas  use  predicates  from  Lg,  the  ones 
that  have  an  operational  semantics  (their  interpretations  are  in  the  structure),  as  well 
as  other  predicates  symbols.  Given  a  compatible  L-operator  X,  the  abstract  theory  Ta 
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may  be  built  up  by  formalizing,  using  the  abstracted  language  A/Lj,  the  same 
knowledge  that  was  formalized  using  language  Lg,  i.e.,  Ta  =  ^(X(Tg)). 

Definition  7  -  Given  a  language  Lg  and  an  L-operator  X,  a  T-operator  T  at  the 
theory  level  is  compatible  with  X  iff  x(^(Lg))  =  □ 

The  definition  of  operators  at  the  theory  level  is  the  most  difficult  step  in  the 
whole  process  of  abstraction,  the  one  that  requires  background  knowledge  to  be 
performed.  A  preliminary  solution  to  theory  abstraction  has  been  proposed  by 
Giordana,  Saitta  and  Ro verso  [4]. 


5  Conclusions 

In  this  paper  we  have  presented  a  new  model  of  abstraction,  which  advocates  the 
primary  role  played  by  perception  in  the  conceptualization  of  a  domain.  The  same 
claim,  put  forward  recently  by  Goldstone  and  Barsalou  [1998],  provides  the  cognitive 
grounds  to  this  model. 

The  presented  work  outlines  only  general  properties  of  abstraction.  We  have  also 
introduced  the  notion  of  operator,  which,  at  each  considered  representation  level, 
maps  a  ground  to  an  “abstract”  level.  Specific  operators  may  also  be  defined  in 
connection  with  particular  application  and/or  tasks. 

Given  the  complexity  of  the  problem,  we  are  well  aware  that  this  work  leaves  open 
many  fundamental  questions.  One  of  them  is  related  to  the  “compatibility”  notion  and  the 
concrete  definition  of  compatible  operators. 
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Abstract  Multidimensional  data  is  often  feature-space  heterogeneous  so  that 
different  features  have  different  importance  in  different  subareas  of  the  whole 
space.  In  this  paper  we  suggest  a  technique  that  searches  for  a  strategic  splitting 
of  the  feature  space  identifying  the  best  subsets  of  features  for  each  instance. 

Our  technique  is  based  on  the  wrapper  approach  where  a  classification  algorithm 
is  used  as  the  evaluation  function  to  differentiate  between  several  feature  sub¬ 
sets.  In  order  to  make  the  feature  selection  local,  we  apply  the  recently  devel¬ 
oped  technique  for  dynamic  integration  of  classifiers.  It  allows  us  to  determine 
what  classifier  and  with  what  feature  subset  should  be  applied  for  each  new  in¬ 
stance.  In  order  to  restrict  the  number  of  feature  combinations  being  analyzed 
we  propose  to  use  decision  trees.  For  each  test  instance  we  consider  only  those 
feature  combinations  that  include  features  present  in  the  path  taken  by  the  test 
instance  in  the  decision  tree  built  on  the  whole  feature  set.  We  evaluate  our 
technique  on  datasets  from  the  UCI  machine  learning  repository.  In  our  experi¬ 
ments,  we  use  the  C4.5  algorithm  as  the  learning  algorithm  for  base  classifiers 
and  for  decision  trees  that  guide  the  local  feature  selection.  The  experiments 
show  advantages  of  the  local  feature  selection  in  comparison  with  the  selection 
of  one  feature  subset  for  the  whole  space. 

1  Introduction 

Current  electronic  data  repositories  contain  enormous  amount  of  data  including  also 
unknown  and  potentially  interesting  patterns  and  relations,  which  are  tried  to  be  re¬ 
vealed  using  knowledge  discovery  and  data  mining  methods  [8].  One  approach  com¬ 
monly  used  is  supervised  machine  learning,  in  which  a  set  of  training  instances  is  used 
to  train  one  or  more  classifiers  that  map  the  space  formed  by  different  features  of  the 
instances  into  the  set  of  class  values  [1],  Each  training  instance  is  usually  represented 
by  a  vector  of  the  values  of  the  features  and  the  class  label  of  the  instance.  An  induc¬ 
tion  algorithm  is  used  to  learn  a  classifier,  which  maps  the  space  of  feature  values  into 
the  set  of  class  values. 

The  multidimensional  data  is  sometimes  feature-space  heterogeneous  so  that  differ¬ 
ent  features  have  different  importance  in  different  subareas  of  the  whole  space.  Many 
methods  have  been  proposed  for  the  purpose  of  feature  selection,  but  almost  all  of 
them  ignore  the  fact  that  some  features  may  be  relevant  only  in  context  (i.e.  in  some 
regions  of  the  instance  space)  [7], 

In  this  paper  we  describe  a  technique  that  searches  for  a  division  of  the  feature 
space  identifying  the  best  subsets  of  features  for  each  instance.  To  make  the  feature 
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selection  local,  we  apply  the  recently  developed  technique  for  dynamic  integration  of 
classifiers  to  determine  what  classifier  and  with  what  feature  subset  is  applied  for  each 
new  instance  [12].  We  make  experiments  with  well-known  datasets  of  the  UCI  ma¬ 
chine  learning  repository  using  ensembles  of  quite  simple  base  classifiers  [4]. 

In  Chapter  2  we  consider  the  dynamic  integration  of  classifiers.  Chapter  3  discusses 
local  feature  selection  with  the  dynamic  classifier  integration.  In  the  next  chapter  we 
consider  our  experiments  with  the  local  feature  selection  technique  on  different  da¬ 
tasets.  We  conclude  briefly  in  Chapter  5  with  a  summary  and  further  research  topics. 

2  Dynamic  Integration  of  Classifiers 

In  this  chapter,  the  dynamic  integration  of  classifiers  is  discussed,  and  a  variation  of 
stacked  generalization,  which  uses  a  metric  to  locally  estimate  the  errors  of  the  base 
classifiers,  is  considered. 

There  are  two  main  approaches  to  the  integration.  First,  combination  approach, 
where  the  base  classifiers  produce  their  classifications.  The  simplest  method  of  com¬ 
bining  classifiers  is  voting  [1],  Examples  of  more  complex  algorithms  are  weighted 
voting  (WV)  and  stacked  generalization  [12], 

Second,  selection  approach,  where  one  of  the  classifiers  is  selected  and  the  final  re¬ 
sult  is  the  result  produced  by  it.  One  very  popular  but  simple  static  selection  approach 
is  CVM  (Cross-Validation  Majority)  [9].  And  an  example  of  more  sophisticated  dy¬ 
namic  selection  approach  predicts  the  correctness  of  the  base  classifiers  for  a  new 
instance  [11].  We  have  elaborated  a  dynamic  approach  that  estimates  the  local  accu¬ 
racy  of  the  base  classifiers  by  analyzing  the  accuracy  in  near-by  instances  [12]. 

The  dynamic  integration  approach  contains  two  phases  [12].  In  the  learning  phase 
(procedure  learning jphase  in  Fig.  1),  the  training  set  T  is  partitioned  into  v  folds.  The 
cross-validation  technique  is  used  to  estimate  the  errors  of  the  base  classifiers  Efx)  on 
the  training  set  and  the  meta-level  training  set  T*  is  formed.  It  contains  the  attributes  of 
the  training  instances  x  and  the  estimates  of  the  errors  of  the  base  classifiers  on  those 
instances  Efii).  Several  cross-validation  runs  can  be  used  in  order  to  obtain  more 
accurate  estimates  of  the  base  classifiers’  errors.  Then  each  estimated  error  will  be 
equal  to  the  number  of  times  that  an  instance  was  incorrectly  predicted  by  the  classi¬ 
fier  when  it  appeared  as  a  test  example  in  a  cross-validation  run.  The  learning  phase 
finishes  with  training  the  base  classifiers  C;  on  the  whole  training  set. 

In  the  application  phase,  the  combining  classifier  (either  the  function 
DS_application _phase  or  the  function  DV_application _phase  in  Fig.  1)  is  used  to 
predict  the  performance  of  each  base  classifier  for  a  new  instance.  Two  different 
functions  implementing  the  application  phase  were  considered  [12].  In  the  DS  (Dy¬ 
namic  Selection)  application  phase  the  classification  error  E is  predicted  for  each 
base  classifier  C  using  a  nearest  neighbor  procedure  and  a  classifier  with  the  smallest 
error  (with  the  least  global  error  in  the  case  of  ties)  is  selected  to  make  the  final  classi¬ 
fication.  In  the  DV  (Dynamic  Voting)  application  phase  each  base  classifier  C.  re¬ 
ceives  a  weight  Wj  that  depends  on  the  local  classifier’s  performance  and  the  final 
classification  is  conducted  by  voting  classifier  predictions  C;(x)  with  their  weights  Wf 
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T i  i-th  fold  of  the  training  set  T 

T*  meta-level  training  set  for  the  combining  algorithm 
c(x)  classification  of  the  instance  with  attributes  x 
C  set  of  base  classifiers 
Cj  j-th  base  classifier 

Cj  (x)  prediction  produced  by  Cj  on  instance  x 
Ej{x)  estimation  of  error  of  Cj  on  instance  x 
E*j{x)  prediction  of  error  of  C;-  on  instance  x 
m  number  of  base  classifiers 
W  vector  of  weights  for  base  classifiers 
nn  number  of  near-by  instances  for  error  prediction 
Wmi  weight  of  i-th  near-by  instance 

procedure  learning_j>hase(T,C) 

begin  {fills  in  the  meta-level  training  set  T*} 
partition  T  into  v  folds 
loop  for  TiCT,  i  =  1 , . . . ,  1/ 

loop  for  j  from  1  to  m  train  (C^T-T*) 
loop  for  xgTi 

loop  for  j  from  1  to  m 

compare  Cj(x)  with  c(x)  and  derive  Ej  (x) 
collect  (x,  E2  (x) , . . . ,  Em(x) )  into  T* 

loop  for  j  from  1  to  m  train (Cj,T) 

end 

function  DS_application _phase{ T*,C,x)  returns  class  of  x 

begin 

loop  for  j  from  1  to  m 

-]  nn 

E'j  4^~yjWNN.  -Ej(xNN.)  {NN  estimation} 

nn  /=i 

I  <—  argminFv  {number  of  cl-er  with  min.  Ej  } 

j 

{with  the  least  global  error  in  the  case  of  ties} 

return  Cj  (x) 
end 

function  DV_application __phase  (T*,C,x)  returns  class  of  x 

begin 

loop  for  j  from  1  to  m 

-j  nn 

Wj  <-1  X  WHNl  ■  Ej (X NN; )  {NN  estimation} 

return  Weighted_Voting (W,  Cx(x) , ,  Cm(x) ) 

end 

Fig.  1.  The  algorithm  for  dynamic  integration  of  classifiers  [12] 
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3  Local  Feature  Selection 

In  data  mining  the  object  of  processing  is  usually  multidimensional  data  presented  by 
a  number  of  features.  Commonly  there  are  present  also  a  number  of  irrelevant  fea¬ 
tures.  In  this  chapter  the  feature  selection  problem  is  discussed  based  on  local  consid¬ 
erations  of  the  relevance  of  each  feature. 

The  feature  space  is  often  heterogeneous,  where  the  features  that  are  important  for 
data  mining  are  different  in  different  regions  of  the  feature  space  [2],  Of  the  two  main 
approaches  to  manage  this  [3]  we  apply  the  approach  where  the  data  mining  problem 
is  divided  into  subproblems  and  the  solution  of  the  whole  classification  task  is  guided 
by  the  heterogeneity  of  the  feature  space. 

First,  a  feature  selection  algorithm  can  be  based  on  a  heuristic  measure  acting  as  a 
filter  extracting  features  from  a  feature  set  before  its  use  by  the  main  algorithm.  Sec¬ 
ond,  a  feature  selection  algorithm  based  on  the  actual  accuracy  acts  as  a  wrapper 
around  the  main  algorithm  [6],  We  propose  to  apply  the  wrapper  approach  for  dy¬ 
namic  integration  of  classifiers  [12]  using  local  classification  accuracy  [15]. 

In  this  paper  we  consider  an  advanced  version  of  the  method  presented  in  [15].  We 
previously  analyzed  all  possible  combinations  of  features  to  build  the  ensemble  of 
base  classifiers.  However,  this  can  be  very  expensive  computationally  and  can  lead  to 
overfitting.  The  big  number  of  feature  subsets  and,  consequently,  integrated  base  clas¬ 
sifiers  dramatically  increases  the  number  of  degrees  of  freedom  in  the  training  proc¬ 
ess,  leading  to  increased  variance  of  predictions  and  an  increased  risk  of  overfitting 
the  data.  To  reduce  the  risk,  we  propose  to  limit  the  number  of  considered  feature 
subsets  in  our  local  wrapper  technique.  The  base  classifiers  in  the  ensemble  can  be 
built  using  combinations  of  only  potentially  locally  relevant  features,  discarding  fea¬ 
tures  that  are  definitely  irrelevant  at  that  region. 

Some  recursive  partitioning  techniques  or  some  heuristic  measures  can  be  used  to 
discard  features  that  are  locally  irrelevant  with  a  high  probability.  Thus,  we  propose  to 
combine  our  wrapper-based  method  with  a  filter  approach,  using  it  in  advance  for 
restricting  the  possible  feature  combinations. 

In  [5]  a  decision  tree  was  proposed  to  be  used  for  local  feature  selection,  where  for 
a  test  case  only  those  features  are  considered  to  be  locally  relevant,  which  lie  on  the 
path  taken  by  the  test  case  in  the  decision  tree.  We  propose  to  use  a  decision  tree  ap¬ 
proach  [5]  as  a  filter  for  feature  selection  with  our  method. 


4  Experiments 

In  this  chapter,  we  present  our  experiments  with  the  use  of  the  C4.5  decision  tree  algo¬ 
rithm  to  guide  the  local  feature  selection  process.  First  we  describe  the  experimental 
setting  and  then  present  results  of  our  experiments.  We  conduct  the  experiments  on 
eight  datasets  taken  from  the  UCI  machine  learning  repository  [4]  and  on  the  Dystonia 
dataset  considered  in  [16],  Previously  we  have  experimentally  evaluated  the  dynamic 
classifier  integration  [12]  and  the  unguided  local  feature  selection  [15].  Here  we  use  a 
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similar  experimental  environment,  but  this  time  the  algorithm  builds  C4.5  decision 
trees  with  and  without  pruning  [14]  at  the  end  of  the  training  phase  for  local  feature 
filtering  at  the  beginning  of  the  application  phase.  For  each  test  instance  we  consider 
only  those  feature  combinations  that  include  features  present  in  the  path  taken  by  the 
test  instance  in  the  decision  tree. 

For  each  dataset  30  test  runs  are  made.  In  each  run  30  percent  of  the  instances  of 
the  dataset  are  by  random  sampling  picked  up  to  the  test  set.  The  rest  70  percent  of  the 
instances  form  the  training  set  which  is  passed  to  the  learning  algorithm.  This  training 
set  is  divided  into  10  folds  using  stratified  random  sampling  because  we  apply  10-fold 
cross-validation  to  build  the  cross-validation  history  for  the  dynamic  integration  of  the 
base  classifiers  [12].  The  base  classifiers  themselves  are  learned  using  the  C4.5  deci¬ 
sion  tree  algorithm  with  pruning  [  14]  on  feature  subsets  including  only  exactly  one  or 
two  features.  In  the  estimation  of  the  classification  errors  of  the  base  classifiers  for  a 
new  instance  we  use  the  collected  classification  information  about  classification  errors 
for  seven  nearest  neighbors  of  the  new  instance  [12].  Based  on  the  comparisons  be¬ 
tween  different  distance  functions  for  dynamic  integration  presented  in  [13]  we  de¬ 
cided  to  use  the  Heterogeneous  Euclidean-Overlap  Metric,  which  produced  good  test 
results.  The  test  environment  was  implemented  within  the  MLC++  framework  (the 
machine  learning  library  in  C++)  [10]. 

Table  1  presents  accuracy  values  and  numbers  of  analyzed  features  for  our  algo¬ 
rithm  of  local  feature  selection  when  considered  feature  subsets  consisted  of  exactly 
one  feature,  and  Table  2  presents  the  same  results  when  the  feature  subsets  consisted 
of  two  features  each.  The  left-hand  side  (until  the  bold  line)  of  the  tables  shows  the 
averages  of  the  classification  accuracies  over  the  30  runs.  The  first  five  columns  in¬ 
clude  the  average  of  the  minimum  accuracies  of  the  base  classifiers  (min),  the  average 
of  average  accuracies  of  the  base  classifiers  (aver),  the  average  of  the  maximum  accu¬ 
racies  of  the  base  classifiers  (max),  the  average  percentage  of  test  instances  where  all 
the  base  classifiers  managed  to  produce  the  right  classification  (agree),  and  the  aver¬ 
age  amount  of  test  instances  where  at  least  one  base  classifier  during  each  run  man¬ 
aged  to  produce  the  right  classification  (cover).  The  next  four  columns  of  the  left-hand 
side  of  Table  1  and  Table  2  include  the  accuracies  for  the  four  types  of  integration  of 
the  base  classifiers. 

These  are:  (1)  CVM  -  Cross- Validated  Majority,  (2)  WV  -  Weighted  Voting,  (3) 
DS  -  Dynamic  Selection,  and  (4)  DV  -  Dynamic  Voting.  The  three  columns  of  the 
right-hand  side  include  the  minimum  (min),  average  (aver),  and  maximum  (max) 
number  of  features  used  to  classify  test  instances.  All  the  above  columns  are  averaged 
over  the  30  test  runs. 

In  vertical  direction  the  main  body  of  the  tables  is  divided  into  groups  of  three  rows 
each  corresponding  to  one  dataset.  The  first  row  contains  accuracies  received  with 
unguided  local  feature  selection,  the  second  row  contains  accuracies  received  when 
the  feature  selection  is  guided  by  the  C4.5  with  pruning,  and  the  third  one  contains 
accuracies  received  when  the  feature  selection  is  guided  by  the  C4.5  without  pruning. 
The  last  group  in  the  tables  shows  accuracies  averaged  over  all  the  datasets. 

When  the  numbers  of  features  are  compared  inside  the  three  lines  of  the  groups  cor¬ 
responding  to  all  the  datasets,  one  can  see  that  the  use  of  the  C4.5  algorithm  with  or 
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without  pruning  for  local  feature  selection  significantly  reduces  the  number  of  locally 
analyzed  features.  When  the  corresponding  accuracies  achieved  are  compared  one  can 
see  that  this  has  often  happened  without  a  loss  in  the  classification  accuracy.  For  many 
datasets,  the  number  of  locally  analyzed  features  is  several  times  less  in  average  than 
the  total  number  of  features.  For  example,  in  the  Dystonia  dataset  there  are  totally  7 
features,  whereas  the  average  number  of  locally  analyzed  features  in  the  case  when 
one-feature  subsets  were  analyzed  (Table  1)  is  1.133  for  pruned  C4.5  and  the  same 
number  for  C4.5  without  pruning  (the  trees  generated  were  already  too  simple  to 
prune). 

Table  1.  Accuracy  values  and  numbers  of  analyzed  features  for  our  algorithm  of  local  feature 
selection  with  feature  subsets  including  one  feature 
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For  many  datasets  and  on  average,  the  classification  accuracy  is  even  higher  with 
the  guided  feature  selection.  Thus,  sometimes  the  guided  feature  selection  helps  to 
improve  the  classification  accuracy,  as  on  the  Breast,  MONK-1,  and  MONK-3  da¬ 
tasets.  However,  some  datasets  are  complex,  containing  many  locally  relevant  fea¬ 
tures,  even  more  than  there  are  levels  in  the  decision  tree.  In  those  cases,  the  guided 
local  feature  selection  usually  produces  slightly  lower  accuracy  than  the  unguided 
feature  selection,  as  on  the  Glass  dataset. 


Table  2.  Accuracy  values  and  numbers  of  analyzed  features  for  our  algorithm  of  local  feature 
selection  with  feature  subsets  each  including  two  features 
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0.509  0.663  0.805  0.241  0.947  0.762  0.748  0.771  0.720 

0.375  0.673  0.887  0.490  0.911  0.778  0.764  0.771  0.767 

0.385  0.676  0.890  0.478  0.919  0.781  0.766  0.774  0.770 

For  some  datasets,  the  dynamic  integration  (local  feature  selection)  is  clearly  better 
than  the  static  approaches,  as  on  the  Liver  and  MONK-3  datasets.  The  MONK-3  da- 
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taset  is  a  good  example  of  the  benefit  from  the  local  feature  selection  guided  by  a 
decision  tree.  On  this  dataset,  the  guided  feature  selection  is  better  than  the  unguided 
one,  and  the  dynamic  integration  is  better  than  the  static  integration.  The  dynamic 
integration  of  classifiers  built  on  different  feature  subsets  guided  by  the  C4.5  decision 
tree  is  the  best  choice  for  this  domain.  The  C4.5  decision  tree  clearly  helps  to  reject 
the  three  irrelevant  features  present  in  the  dataset.  One  can  see  that  at  maximum,  just 
three  relevant  features  are  selected  as  locally  relevant  with  the  C4.5  on  this  dataset. 

According  to  the  last  groups  of  the  tables,  the  guided  feature  selection  is  better  on 
average  than  the  unguided  one.  The  C4.5  without  pruning  naturally  generates  bigger 
trees,  and  it  leads  to  greater  average  number  of  locally  relevant  features  selected. 
However,  the  accuracies  of  the  algorithm  with  C4.5  with  and  without  pruning  usually 
are  almost  undistinguishable.  Only  on  the  Glass  dataset  pruning  gives  clearly  better 
results.  When  the  two  tables  are  compared,  one  could  see,  that  Table  2  contains  natu¬ 
rally  greater  accuracies  (because  subsets  with  2  features  were  analyzed  whereas  only 
one-feature  subsets  were  analyzed  in  Table  1).  However,  the  difference  is  less  than 
one  could  expect.  Integration  of  classifiers  based  on  only  one  feature  gives  surpris¬ 
ingly  good  results.  And  these  results  can  be  only  slightly  improved  when  more  fea¬ 
tures  are  considered. 


5  Conclusion 

In  this  paper  we  described  a  technique  that  searches  for  a  division  of  the  feature  space 
identifying  the  best  subsets  of  features  for  each  instance.  The  technique  is  based  on  the 
local  wrapper  approach,  and  uses  the  method  for  dynamic  integration  of  classifiers  to 
determine  what  classifier  and  with  what  feature  subset  is  applied  for  each  new  in¬ 
stance.  At  the  application  phase,  in  order  to  restrict  the  number  of  feature  combina¬ 
tions  being  analyzed,  we  used  the  C4.5  decision  tree  built  on  the  whole  feature  set  as  a 
feature  filter.  For  each  test  instance  we  considered  only  those  feature  combinations 
that  included  features  present  in  the  path  taken  by  the  test  instance  in  the  decision  tree. 
Our  technique  can  be  applied  in  the  case  of  implicit  heterogeneity  when  the  regions  of 
heterogeneity  cannot  be  easily  defined  by  a  simple  dependency. 

We  conducted  experiments  on  datasets  of  the  UCI  machine  learning  repository  us¬ 
ing  ensembles  of  simple  base  classifiers  each  generated  on  either  one  or  two  features. 
The  results  achieved  are  promising  and  show  that  the  local  feature  selection  in  com¬ 
parison  with  selecting  only  one  feature  set  for  the  whole  space  can  be  advantageous. 

Further  experiments  can  be  conducted  to  make  deeper  analysis  of  applying  recur¬ 
sive  partitioning  and  the  dynamic  integration  of  classifiers  for  local  feature  selection 
(and  particularly  to  define  the  dependency  between  the  parameters  of  local  feature 
selection,  characteristics  of  a  domain,  and  the  data  mining  accuracy).  Decision  trees 
built  on  the  whole  instance  set  were  used  in  our  experiments.  Use  of  other  feature 
filters  can  be  tested  in  future  experiments.  Another  potentially  interesting  topic  for 
further  research  is  the  analysis  of  feature  subsets  without  the  restriction  on  their  size. 
Also  it  would  be  interesting  to  consider  an  application  of  this  technique  to  a  hard  real- 
world  problem. 
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Abstract .  This  paper  is  devoted  to  the  problem  of  learning  to  predict 
ordinal  (i.e.,  ordered  discrete)  classes  using  classification  and  regression 
trees.  We  start  with  S-CAR.T,  a  tree  induction  algorithm,  and  study  var¬ 
ious  ways  of  transforming  it  into  a  learner  for  ordinal  classification  tasks. 
These  algorithm  variants  are  compared  on  a  number  of  benchmark  data 
sets  to  verify  the  relative  strengths  and  weaknesses  of  the  strategies  and 
to  study  the  trade-off  between  optimal  categorical  classification  accuracy 
(hit  rate)  and  minimum  distance-based  error.  Preliminary  results  indi¬ 
cate  that  this  is  a  promising  avenue  towards  algorithms  that  combine 
aspects  of  classification  and  regression. 


1  Introduction 

Learning  to  predict  discrete  classes  or  numerical  values  from  preclassified  ex¬ 
amples  has  long  been,  and  continues  to  be,  a  central  research  topic  in  Machine 
Learning  (e.g.,  [Breiman  et  ah,  1984,  Quinlan,  1992,  Quinlan  1993]).  A  class 
of  problems  between  classification  and  regression,  learning  to  predict  ordinal 
classes,  i.e.,  discrete  classes  with  a  linear  ordering,  has  not  received  much  atten¬ 
tion  so  far,  which  seems  somewhat  surprising,  as  there  are  many  classification 
problems  in  the  real  world  that  fall  into  that  category. 

Recently,  [Potharst  &  Bioch,  1999]  presented  a  tree-based  algorithm  for  the 
prediction  of  ordinal  classes.  [Potharst  Bioch,  1999]  assume  that  the  indepen¬ 
dent  variables  are  ordered  as  well,  which  implies  that  the  predictions  made  should 
be  consistent  with  the  order  of  the  attribute  values  in  the  decision  nodes.  So, 
the  authors  present  “repair  strategies”  correcting  inconsistent  trees  in  case  these 
consistency  constraints  are  violated,  as  well  as  an  algorithm  for  constructing 
consistent  trees  in  the  first  place. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  426-434,  2000. 
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Other  machine  learning  research  that  seems  relevant  to  the  problem 
of  predicting  ordinal  classes  is  work  on  cost-sensitive  learning.  In  the  do¬ 
main  of  propositional  learning,  some  induction  algorithms  have  been  pro¬ 
posed  that  can  take  into  account  matrices  of  misclassification  costs  (e.g., 
[Schiffers,  1997, Turney,  1995].  Such  cost  matrices  can  be  used  to  express  relative 
distances  between  classes.  In  the  area  of  statistics,  there  are  methods  directly 
relevant  to  our  problem  (e.g.,  Ordinal  Logistic  Regression  [McCullagh,  1980]); 
some  of  these  have  also  been  studied  in  the  field  of  neural  networks  (e.g., 
[Mathieson,  1996]).  However,  our  goal  is  to  develop  induction  algorithms  that 
produce  interpretable,  symbolic  models.  Moreover,  our  algorithm  S-CART,  to 
be  presented  below,  can  learn  in  both  propositional  and  relational  domains. 

The  purpose  of  the  research  described  in  this  paper  is  to  study  ways  of 
learning  to  predict  ordinal  classes  using  regression  trees.  We  will  start  with  an 
algorithm  for  the  induction  of  regression  trees  and  turn  it  into  an  ordinal  learner 
by  some  simple  modifications.  This  seems  a  natural  strategy  because  regression 
algorithms  by  definition  have  a  notion  of  relative  distance  of  target  values,  while 
classification  algorithms  usually  do  not.  More  precisely,  we  start  with  the  algo¬ 
rithm  S-CART  ( Structural  Classification  and  Regression  Trees)  [Kramer  1996, 
Kramer  1999]  and  study  several  modifications  of  the  basic  algorithm  that  turn  it 
into  a  distance-sensitive  classification  learner.  Several  variants  of  this  algorithm 
are  compared  on  a  number  of  data  sets  to  verify  the  relative  strengths  and  weak¬ 
nesses  of  the  strategies  and  to  study  the  trade-off  between  optimal  categorical 
classification  accuracy  (hit  rate)  and  minimum  distance-based  error. 

2  The  Basic  Learning  Algorithm:  S-CART  (Structural 
Classification  and  Regression  Trees) 

Structural  Classification  and  Regression  Trees  (S-CART)  [Kramer  1996,  Kramer 
1999]  is  an  algorithm  that  learns  a  first-order  theory  for  the  prediction  of  ei¬ 
ther  discrete  classes  or  numerical  values  from  examples  and  relational  back¬ 
ground  knowledge.  The  algorithm  constructs  a  tree  containing  a  positive  literal 
or  a  conjunction  of  literals  in  each  node,  and  assigns  a  discrete  class  or  a  nu¬ 
meric  value  to  each  leaf.  S-CART  is  a  full-fledged  relational  version  of  CART 
[Breiman  et  ah,  1984].  After  the  tree  growing  phase,  the  tree  is  pruned  using 
so-called  error-complexity  pruning  for  regression  or  cost-complexity  pruning  for 
classification  [Breiman  et  al.,  1984],  These  types  of  pruning  are  based  on  a  sep¬ 
arate  “prune  set”  of  examples  or  on  cross-validation. 

For  the  construction  of  a  tree,  S-CART  follows  the  general  procedure  of  top- 
down  decision  tree  induction  algorithms  [Quinlan,  1993],  It  recursively  builds  a 
binary  tree,  selecting  a  positive  literal  or  a  conjunction  of  literals  (as  defined 
by  user-defined  schemata  [Silverstein  &  Pazzani,  1991])  in  each  node  of  the  tree 
until  a  stopping  criterion  is  fulfilled.  The  algorithm  keeps  track  of  the  examples 
in  each  node  and  the  positive  literals  or  conjunctions  of  literals  in  each  path 
leading  to  the  respective  nodes.  This  information  can  be  turned  into  a  clausal 
theory  (i.e.,  a  set  of  first-order  classification  or  regression  rules). 
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As  a  regression  algorithm,  S-CART  is  designed  to  predict  a  numeric  (real) 
value  in  each  node  and,  in  particular,  in  each  leaf.  In  the  original  version  of  the 
algorithm  the  target  value  predicted  in  a  node  (let  us  call  this  the  center  value 
from  now  on)  is  simply  the  mean  of  the  numeric  class  values  of  the  instances 
covered  by  the  node.  A  natural  choice  for  the  evaluation  measure  for  rating 
candidate  splits  during  tree  construction  is  then  the  Mean  Squared  Error  (MSE) 
of  the  example  values  relative  to  the  means  in  the  two  new  nodes  created  by  the 
split: 


MSE  = 


n\  +  n2 


X  X&*  ~  ^)2 


i=i j= i 


(i) 


where  n,  is  the  number  of  instances  covered  by  branch  i,  i)ij  is  the  value  of  the 
dependent  variable  of  training  instance  ej  in  branch  i,  and  yi  is  the  mean  of  the 
target  values  of  all  training  instances  in  branch  i. 

In  constructing  a  single  tree,  the  simplest  possible  stopping  criterion  is  used 
to  decide  whether  the  tree  should  be  further  refined:  S-CART  stops  extending 
the  tree  given  some  node  when  no  literal(s)  can  be  found  that  produce(s)  two 
partitions  of  the  training  instances  in  the  node  with  a  required  minimum  car¬ 
dinality.  The  post-pruning  strategy  then  takes  care  of  reducing  the  tree  to  an 
appropriate  size. 

S-CART  has  been  shown  to  be  competitive  with  other  regression  algorithms. 
Its  main  advantages  are  that  it  offers  the  full  power  and  flexibility  of  first-order 
(Horn  clause)  logic,  provides  a  rich  vocabulary  for  the  user  to  explicitly  represent 
a  suitable  language  bias  (e.g.  through  the  provision  of  schemata),  and  produces 
trees  that  are  interpretable  as  well  as  good  predictors. 

As  our  goal  is  to  predict  discrete  ordered  classes,  S-CART  cannot  be  used 
directly  for  this  task.  We  will,  however,  include  results  with  standard  S-CART 
in  the  experimental  section  to  find  out  how  paying  attention  to  ordinal  classes 
influences  the  mean  squared  error  achievable  by  a  learner. 


3  Inducing  Trees  for  the  Prediction  of  Ordinal  Classes 

In  the  following,  we  describe  a  few  simple  modifications  that  turn  S-CART  into  a 
learning  algorithm  for  ordinal  classification  problems.  In  section  3.2,  we  consider 
some  pre-processing  methods  that  also  might  improve  the  results. 


3.1  Adapting  S-CART  to  Ordinal  Class  Prediction 

The  most  straightforward  way  of  adapting  a  regression  algorithm  like  S-CART 
to  classification  tasks  is  to  simply  run  the  algorithm  on  the  given  data  as  if  the 
ordinal  classes  (represented  by  integers)  were  real  values,  and  then  to  apply  some 
sort  of  post-processing  to  the  resulting  rules  or  regression  tree  that  translates 
real-valued  predictions  into  discrete  class  labels. 
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An  obvious  post-processing  method  is  rounding.  S-CART  is  run  on  the  train¬ 
ing  data,  producing  a  regular  regression  tree.  The  real  values  predicted  in  the 
leaves  of  the  tree  are  then  simply  rounded  to  the  nearest  of  the  ordinal  classes 
(not  to  the  nearest  integer,  as  the  classes  may  be  discontiguous;  after  pre¬ 
processing,  they  might  indeed  be  non-integers  —  see  section  3.2  below). 

More  complex  methods  for  mapping  predicted  real  values  to  symbolic  (ordi¬ 
nal)  class  labels  are  conceivable.  In  fact,  we  did  experiments  with  an  algorithm 
that  greedily  searches  for  a  mapping,  within  a  defined  class  of  functions,  that 
minimizes  the  mean  squared  error  of  the  resulting  (mapped)  predictions  on  the 
training  set.  Initial  experiments  were  rather  inconclusive;  in  fact,  there  were 
indications  of  the  algorithm  overfitting  the  training  data.  However,  more  sophis¬ 
ticated  methods  might  turn  out  to  be  useful.  This  is  one  of  the  goals  of  our 
future  research. 

An  alternative  to  post-processing  is  to  modify  the  way  S-CART  computes 
the  target  values  in  the  nodes  of  the  tree  during  tree  construction.  We  can  force 
S-CART  to  always  predict  integer  values  (or  more  generally:  a  valid  class  from 
the  given  set  of  ordinal  classes)  in  any  node  of  the  tree.  The  leaf  values  will  thus 
automatically  be  valid  classes,  and  no  post-processing  is  necessary. 

It  is  a  simple  matter  to  modify  S-CART  so  that  instead  of  the  mean  of  the 
class  values  of  instances  covered  by  a  node  (which  will  in  general  not  be  a  valid 
class  value),  it  chooses  one  of  the  class  values  represented  in  the  examples  covered 
by  the  node  as  the  center  value  that  is  predicted  by  the  node,  and  relative  to 
which  the  node  evaluation  measure  (e.g.,  the  mean  squared  error,  see  Section 
2  above)  is  computed.  Note  that  in  this  way,  we  modify  S-CART’s  evaluation 
heuristic  and  thus  its  bias. 

There  are  many  possible  ways  of  choosing  a  center  value;  we  have  imple¬ 
mented  three:  the  median,  the  rounded  mean,  and  the  mode ,  i.e.,  the  most  fre¬ 
quent  class.  Let  Ei  be  the  set  of  training  examples  covered  by  node  JV)  during 
tree  construction  and  Ci  the  multiset  of  the  class  labels  of  the  examples  in  Ei, 
with  | Hi |  =  \Ci\  =  n.  In  the  MEDIAN  strategy,  S-CART  selects  the  class  q  as 
center  value  that  is  the  median  of  the  class  labels  in  C\;  in  other  words,  if  we 
assume  that  the  example  set  Ei  is  sorted  with  respect  to  the  class  values  of  the 
examples,  MEDIAN  chooses  the  class  of  the  (n/2)th  example.1  In  contrast,  the 
RoUNDEDMEANToCLASS  strategy  chooses  the  class  closest  to  the  (real- valued) 
mean  c  of  the  class  values  in  C\.  Finally,  in  the  Mode  strategy  the  center  value 
Ci  for  node  Ni  is  chosen  to  be  the  class  with  the  highest  frequency  in  Q. 

Table  1  summarizes  the  variants  of  S-CART  that  will  be  put  to  the  test  in 
Section  4  below. 

3.2  Preprocessing 

The  results  of  regression  algorithms  can  often  be  improved  by  applying  various 
transformations  to  the  raw  input  data  before  learning.  The  basic  idea  underly- 

1  In  this  case  the  Mean  Absolute  Deviation  (MAD)  is  used  as  distance  metric  instead 
of  the  Mean  Squared  Error,  because  the  former  measure  is  the  one  that  is  known  to 
be  minimized  by  the  median. 
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Table  1.  Variants  of  S-CART  for  learning  ordinal  classes. 


Name 

Formula 

Postproc.  Round 

Ci  =  mean  of  the  ctj  6  C*; 

real  values  in  leaves  of  learned  tree  are  rounded 

to  nearest  class  in  C; 

Median 

Ci  =  median  of  class  labels  in  multiset  Ci 

RoundedMeanToClass 

c  =  mean  of  the  dj  €  Ci, 

Ci  =  c  rounded  to  nearest  class  dj  £  Ci 

Mode 

Ci  =  most  frequent  class  in  Ct 

ing  different  data  transformations  is  that  numbers  may  represent  fundamentally 
different  types  of  measurements.  [Mosteller  &  Tukey,  1977]  distinguish,  among 
others,  the  broad  classes  of  amounts  and  counts  (which  cannot  be  negative), 
ranks  (e.g.,  1  =  smallest,  2  =  next-to-smallest,  ...),  and  grades  (ordered  la¬ 
bels,  as  in  A,  B,  C,  D,  E).  They  suggest  the  following  types  of  pre-processing 
transformations:  for  amounts  and  counts,  translate  value  v  to  tv  =  log(w  +  c); 
for  ranks,  tv  =  log((u  —  l/Z)/(N  —  v  +  2/3)),  where  N  is  the  maximum  rank; 
and  for  grades,  tv  =  (<f>(P)  —  <f>(p))/ (P  —  p),  where  P  is  the  fraction  of  ob¬ 
served  values  that  are  at  least  as  big  as  v,  p  is  the  fraction  of  values  >  v,  and 
4>{x)  =  x  logo;  +  (1  —  x)  log(l  —  x).  We  have  tentatively  implemented  these  three 
pre-processing  methods  in  our  experimental  system  and  applied  the  appropri¬ 
ate  transformation  to  the  respective  learning  problem  in  our  experiments.  Table 
2  summarizes  them  in  succinct  form,  in  the  notational  frame  of  our  learning 
problem. 

Note  that  these  transformations  do  not  by  themselves  contribute  to  the  goal 
of  learning  rules  for  ordinal  classes.  They  are  simply  tested  here  as  additional 
enhancements  to  the  methods  described  above.  In  fact,  pre-processing  usually 
transforms  the  original  ordinal  classes  into  real  numbers.  That  is  no  problem, 
however,  as  the  number  of  distinct  values  remains  unchanged.  Thus,  the  trans- 


Table  2.  Pre-processing  types  (c  =  original  class  value;  tc  =  transformed  class  value) 


Name 

Formula 

Raw 

No  pre-processing  (tc  —  c ) 

Counts 

tc  —  log(c  +  1  —  min(CZasses)) 

Ranks 

tc  =  log((c  —  1/3 )/(N  —  c  +  2/3)),  where 

N  =  ma  x(Classes) 

Grades 

tc  =  ( 4>(P )  -  <t>(p))/(P  -  p),  where 

4>( x)  =  a:  log  a:  +  (1  —  x)  log(l  —  x), 

P  =  fraction  of  observed  class  values  >  c, 
p  =  fraction  of  observed  class  values  >  c 
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formed  values  can  still  be  treated  as  discrete  class  values  without  changing  the 
learning  algorithms. 

In  the  experiments,  we  applied  only  the  one  type  of  pre-processing  that  we 
considered  suitable  for  the  dependent  variable  of  the  given  learning  problem. 
Subsequently,  we  rounded  to  the  next  class  in  the  “transformed  space”  and 
mapped  this  prediction  back.  As  it  turned  out,  the  dependent  variable  was  a 
“grade”  in  all  four  application  domains.  So,  due  to  the  nature  of  the  data,  we 
actually  use  only  one  transformation  in  the  experiments.  Although  we  conjecture 
that  the  other  transformations  might  give  good  results  as  well,  this  has  to  be 
confirmed  (or  refuted)  in  further  experiments  in  other  domains. 

4  Experiments 

4.1  Algorithms  compared 

In  the  following,  we  experimentally  compare  the  S-CART  variants  and  prepro¬ 
cessing  methods  on  several  benchmark  datasets.  Three  quantities  will  be  mea¬ 
sured: 

1.  Classification  Accuracy  as  the  percentage  of  exact  class  hits, 

2.  the  Root  Mean  Squared  Error  (RMSE)  y/l/n  ~  c)2  °f  the  predic¬ 

tions  on  the  test  set,  as  a  measure  of  the  average  distance  of  the  algorithms’ 
predictions  from  the  true  class,  and 

3.  the  Spearman  rank  correlation  coefficient  with  a  correction  for  ties,  which  is 
a  measure  for  the  concordance  of  actual  and  predicted  ranks. 

As  ordinal  class  prediction  is  somewhere  ‘between’  classification  and  regres¬ 
sion,  we  additionally  include  two  ‘extreme’  algorithms  in  the  experimental  com¬ 
parison.  One,  S-CART_CLASS,  is  a  variant  of  S-CART  designed  for  categorical 
classification.  S-CART_CLASS  chooses  the  most  frequent  class  in  a  node  as  cen¬ 
ter  value  and  uses  the  Gini  index  of  diversity  [Breiman  et  al.,  1984]  as  evaluation 
measure;  it  does  not  pay  attention  to  the  distance  between  classes.  The  other  ex¬ 
treme,  called  S-CART_REGRESS,  is  simply  the  original  S-CART  as  a  regression 
algorithm  that  acts  as  if  the  task  were  to  predict  real  values;  we  are  interested  in 
finding  out  how  much  paying  attention  to  the  discreteness  of  the  classes  costs  in 
terms  of  achievable  RMSE.  (Of  course,  the  percentage  of  exact  class  hits  achieved 
by  S-CART_REGRESS  cannot  be  expected  to  be  high.)  Finally,  we  will  also  list 
the  Default  or  Baseline  Accuracy  for  each  algorithm  on  each  data  set  and  the 
corresponding  Baseline  RMSE. 

4.2  Data  sets 

The  algorithms  were  compared  on  four  datasets  that  are  characterized  by  a  clear 
linear  ordering  among  the  classes.  Three  of  the  data  sets  were  taken  from  the 
UCI  repository:  Balance,  Cars  and  Nursery.  The  fourth  dataset,  the  Biodegrad¬ 
ability  data  set  [Dzeroski  et  al.,  1999],  describes  328  chemical  substances  in  the 
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familiar  “atoms  and  bonds”  representation  [Srinivasan  et  al.,  1995].  The  task  is 
to  predict  the  half-rate  of  surface  water  aerobic  aqueous  biodegradation  in  hours. 
For  previous  experiments,  we  had  already  discretized  this  quantity  and  mapped 
it  to  the  four  classes  fast,  moderate,  slow,  and  resistant,  represented  as  1,  2,  3, 
and  4. 

4.3  Results 

In  Table  4.3,  we  summarize  the  results  (RMSE  and  classification  accuracy)  of 
10-fold  stratified  cross-validation  runs  on  these  data  sets. 

The  first  and  most  fundamental  obervation  we  make  is  that  the  learners 
improve  upon  the  baseline  values  in  almost  all  cases,  both  in  terms  of  RMSE  and 
in  terms  of  classification  accuracy.  In  other  words,  they  really  learn  something. 

As  expected,  there  seems  to  be  a  fundamental  tradeoff  between  the  two 
goals  of  error  minimization  and  accuracy  maximization.  This  tradeoff  shows 
most  clearly  in  the  results  of  the  ‘extreme’  algorithms  S-CART_REGRESS  and 
S-CART-CLASS:  S-CART-CLASS,  which  solely  seeks  to  optimize  the  hit  rate 
during  tree  construction  but  has  no  notion  of  class  distance,  is  among  the  best 
class  predictors  in  all  four  domains,  but  among  the  worst  in  terms  of  RMSE. 
S- CART_REGRESS,  on  the  other  hand,  is  rather  successful  as  a  minimizer  of  the 
RMSE,  but  unusable  as  a  classifier. 

Interestingly,  neither  of  the  two  solves  its  particular  problem  optimally:  some 
ordinal  learners  beat  S-CART_CLASS  in  terms  of  accuracy,  and  some  beat  the 
regression  “specialist”  S- CART -REGRESS  in  terms  of  the  RMSE. 

For  Balance  and  for  Cars,  both  the  pre-processing  and  the  simple  post¬ 
processing  method  are  able  to  achieve  good  predictive  accuracy  while  at  the 
same  time  keeping  an  eye  on  the  class-distance-weighted  error.  These  methods 
also  perform  favorably  in  terms  of  the  Spearman  rank  correlation  coefficient. 

For  Nursery,  the  results  are  less  pronounced.  Still,  both  methods  improve 
upon  the  classification  result  of  S-CART_REGRESS.  Post-processing  leads  to  a 
slight  improvement  in  terms  of  the  RMSE  and  the  Spearman  rank  correlation 
coefficient  (note  that  these  are  results  for  12961  examples). 

For  Balance,  Cars  and  Nursery,  methods  modifying  the  center  value  during 
tree  construction  (Median,  RoundedMeanToClass  and  Mode)  do  not  seem 
to  perform  as  well.  In  particular  for  Nursery,  these  methods  perform  drastically 
worse  than  Preproc.  Grades  and  Postproc.  Round. 

Results  for  Biodegradability  appear  to  be  different  from  the  other  results. 
The  biodegradability  domain  is  different  from  the  other  domains  in  several  re¬ 
spects:  It  has  fewer  examples,  it  is  known  to  have  class  noise  and  it  is  essentially 
relational.  Here,  methods  modifying  the  center  value  during  tree  construction 
perform  better,  but  not  good  enough  to  be  competitive  with  either  the  classifi¬ 
cation  method  or  the  regression  method.  Still,  it  should  be  noted  that  the  RMSE 
of  these  methods  is  between  the  RMSE  of  the  classification  “specialist”  and  the 
one  of  the  regression  “specialist” . 

Drawing  more  general  conclusions  from  these  limited  experimental  data 
seems  unwarranted.  Our  results  so  far  show  that  tree  learning  algorithms  for 
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predicting  ordinal  classes  can  be  naturally  derived  from  regression  tree  algo¬ 
rithms,  but  more  extensive  experiments  with  larger  data  sets  from  diverse  areas 
will  be  needed  to  establish  the  precise  capabilities  and  relative  advantages  of 
these  algorithms. 


Table  3.  Results  from  10-fold  cross-validation  for  Balance  (625  examples,  4  at¬ 
tributes),  Cars  (1278  examples,  6  attributes),  Nursery  (12961  examples,  8  attributes) 
and  Biodegradability  (328  examples) 


Balance 

Cars 

Approach 

|Accuracy 

Baseline 

46.1% 

- 

- 

S-CART_class 

77.8% 

0.80 

0.676 

94.6% 

0.933 

S-CART_regress 

6.4% 

0.69 

0.677 

78.7% 

0.23 

0.948 

Preproc.  Grades 

0.70 

0.732 

94.7% 

0.24 

0.942 

Postproc.  Round 

0.70 

0.733 

94.7% 

0.25 

0.946 

Median 

76.5% 

0.692 

91.5% 

0.34 

0.866 

RoundedMeanToClass 

75.0% 

m 

0.747 

93.0% 

0.29 

0.913 

Mode 

78.9% 

EEi 

0.710 

88.9% 

0.40 

0.818 

Nursery 

Biodegradability 

Approach 

Accuracy 

Accuracy 

iu.Mna 

Baseline 

33.3% 

- 

■Kil 

- 

S-CART_class 

98.4% 

0.14 

0.988 

57.9% 

0.84 

0.561 

S-CART.regress 

88.0% 

0.11 

0.984 

1.2% 

0.76 

0.537 

|3j 

97.7% 

0.14 

0.986 

43.3% 

0.84 

0.436 

98.2% 

0.13 

0.990 

48.8% 

0.82 

0.489 

Median 

92.8% 

b ma 

0.963 

50.3% 

0.79 

0.538 

RoundedMeanToClass 

93.1% 

vm 

0.964 

47.3% 

0.80 

0.510 

Mode 

92.3% 

m 

50.3% 

0.82 

0.506 

5  Further  Work  and  Conclusion 


Further  work  will  be  to  perform  other  experiments  including  the  other  transfor¬ 
mations  suggested  by  Mosteller  and  Tukey.  It  also  would  be  interesting  to  build 
tree  induction  algorithms  that  do  not  enforce  the  prediction  of  “legal”  classes 
during  tree  construction,  but  deal  with  this  problem  in  the  pruning  phase.  More¬ 
over,  it  might  be  rewarding  to  experiment  with  tree  learners  that  optimize  some 
other  measure  (such  as  the  Spearman  rank  correlation  coefficient)  for  the  pre¬ 
diction  of  ordinal  classes. 

In  this  paper,  we  have  taken  first  steps  towards  effective  methods  for  learn¬ 
ing  to  predict  ordinal  classes  using  regression  trees.  We  have  shown  how  al¬ 
gorithms  for  learning  ordered  discrete  classes  can  be  derived  by  simple  modi¬ 
fications  to  a  basic  regression  tree  algorithm.  Preliminary  experiments  in  four 
benchmark  domains  have  shown  that,  in  some  cases,  the  resulting  algorithms 
are  able  to  achieve  good  predictive  accuracy  while  at  the  same  time  keeping  the 
class-distance- weighted  error  low. 


I  V  I 


A  typical  information  retrieval  (IR)  systems  S  can  be  defined  as  a  5-tuple, 
S  —  (T,  D,  Q.  V.  /),  where:  T  is  a  set  of  index  terms,  D  is  a  set  of  documents,  Q  is 
a  set  of  queries,  V  is  a  subset  of  real  numbers,  and  f  :  D  x  Q  =$■  V  is  a  retrieval 
function  between  a  query  and  a  document.  IR  systems  based  on  the  vector 
processing  model  represent  documents  by  vectors  of  term  values  of  the  form 
d  =  f2,  Wd2  \  •••;*«)  Wdn),  where  tj  is  an  index  term  in  d  (i.e.  £  T  f~l  d) 

and  Wdf  is  the  weight  of  ti  that  reflects  relative  importance  of  f*  in  d.  Similarly, 
for  a  query  q  £  Q,  it  is  represented  as  q  —  (qi,  wqi ;  q2,wq2, . . . ;  qm,  wqm),  where 
qi  £  T  is  an  index  term  in  q  (i.e.  qt£TC\q)  and  wqi  is  the  weight  of  query  term 
qi  that  reflects  relative  importance  of  qi  in  q. 

Our  objective  is  to  formulate  an  optimal  query,  Qopt,  that  discriminates  more 
preferred  documents  from  less  preferred  documents.  With  this  objective  in  mind, 
we  define  a  preference  relation  >;  on  a  set  of  documents,  A,  in  a  retrieval  (ran¬ 
king)  output  as  follows.  For  d,  d'  £  A,d>z  d!  is  interpreted  as  d  is  preferred  to 
or  is  equally  good  as  d! .  It  is  assumed  that  the  user’s  preference  relation  on  A 
yields  a  weak  order  where  the  following  conditions  are  hold  [8] : 
d  >z  d!  or  d!  y  d. 
d  y  d'  and  d!  y  d"  =>  d  >z  d" . 

The  essential  motivation  is  that  Qopt  provides  an  acceptable  retrieval  output; 
that  is,  for  all  d,  d!  £  A,  there  exists  Qopt  €  Q  such  that  d,  y  d’  =>  f(Qopt,  d)  > 
f{Qopt,dr). 

Given  the  user  ranking  as  a  preference  relation  defined  on  a  set  of  documents, 
a  system  that  produces  a  system  ranking  closer  to  the  user  ranking  is  better  than 
a  system  that  produces  a  ranking  that  is  further  away.  To  quantify  this  idea, 
a  performance  measure  may  be  derived  by  using  the  distance  between  a  user 
ranking  and  a  system  ranking.  A  possible  evaluation  measure  is  Rnorm  as  sug¬ 
gested  by  Bollmann  and  Wong  [2],  other  measures  have  also  been  proposed  [9]. 
Let  ( D ,  >z)  be  a  document  space,  where  D  is  a  finite  set  of  documents  and  y  be 
a  preference  relation  as  defined  above.  Let  A  be  some  ranking  of  D  given  by  a 
retrieval  system.  Then  Rnorm  is  defined  as 

1  /  c+  _  q-  \ 

Rnorm(A)  =  -  1  +  -  (1) 

*  \  •Jmax  / 

where  S+  is  the  number  of  document  pairs  where  a  preferred  document  is  ranked 

higher  than  non-preferred  document,  S~  is  the  number  of  document  pairs  where 

e - -i  __j  o-l- 


i  i  • 
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f(Q,d)  >  f(Q,  d')  =>  QTd  >  QT d!  =►  QT{d  -  d')  >  0  =*  /(Q,  6)  >  0.  The  steps 
of  the  algorithm  are  as  follows. 

1.  Choose  a  starting  query  vector  Qo!  let  fc  =  0. 

2.  Let  Qk  be  a  query  vector  at  the  start  of  the  (k+l)  th  iteration;  identify  the 

following  set  of  difference  vectors: 

r(Qfc)  =  {b  =  d  -  d!  :  d  h  d'  and  f{Qk,b)  <  0}; 

if  r(Qfe)  =  0,  Qopt  =  Qk  is  a  solution  and  exit,  otherwise, 

3.  Let 

Qk- f-1  Qk  T  ^  '  b 

ber(Qk) 

4.  k  =  k  +  1;  go  back  to  Step  (2). 

Theoretically  it  is  known  that  SDA  terminates  only  if  the  set  of  retrieved 
documents  is  linearly  separable.  Therefore,  a  practical  implementation  of  the 
algorithm  should  guarantee  that  the  algorithm  terminates  whether  or  not  the 
retrieved  set  is  linearly  separable.  In  this  paper,  we  use  a  pre-defined  iteration 
number  and  Rnorm  measure  for  this  purpose.  The  algorithm  is  terminated  if  the 
iteration  number  reaches  the  pre-defined  limit  or  the  Rnorm  value  of  the  current 
query  is  higher  than  or  equal  to  some  pre-defined  value.  Within  the  algorithm 
loop  we  continually  update  the  query,  Qk,  that  yields  the  highest  Rn0rm  value 
in  order  to  return  Qk  as  optimal  query. 

2  Effectiveness  Measures 

In  order  to  measure  the  performance  of  a  classifier,1  we  use  text  categorization 
effectiveness  measures.  There  are  a  number  of  effectiveness  measures  employed 
in  evaluating  text  categorization  algorithms.  Most  of  these  measures  are  based 
on  the  contingency  table  model.  Consider  a  system  that  is  required  to  categorize 
n  documents  by  a  query,  the  result  is  an  outcome  of  n  binary  (or  two-level)  deci¬ 
sions.  From  these  decisions  a  dichotomous  table  is  constructed  as  in  Figure  1(a). 
Each  entry  in  the  table  specifies  the  number  of  documents  with  the  specified 
outcome  label.  For  example,  a  is  the  number  of  documents  whose  predicted  and 
actual  labels  agree  upon  being  relevant. 

In  our  experiment,  the  performance  measures  are  based  on  precision  and 
recall  whose  values  are  computed  as  a/ (a  +  b)  and  a/ (a  +  c)  in  Fig  1(a),  respec¬ 
tively.  Usually  a  single  composite  recall-precision  graph  is  reported  reflecting  the 
average  performance  of  all  individual  queries  in  the  system.  Two  average  effec¬ 
tiveness  measures,  widely  used  in  the  literature,  are:  Macro-average  and  Micro¬ 
average  [4].  In  information  retrieval,  Macro-average  is  preferred  in  evaluating 


1  In  this  paper,  a  classifier  is  defined  as  a  query.  Therefore,  we  will  use  query  and 
classifier  interchangeably. 
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Predicted 

Label 

Actual  Label 

relevant 

non-relevant 

Relevant 

a 

b 

Non- relevant 

c 

d 

Training 

Test 

With  at  least  one  topic 

7,775 

3,019 

With  no  topic 

1,828 

280 

Total 

9,603 

3,299 

(a) 


(b) 


Fig.  1.  (a)  Measures  of  system  effectiveness,  (b)  Number  of  documents  in  the  collection. 


query-driven  retrieval,  while  in  text  categorization  Micro-average  is  preferred. 
Consider  a  system  with  n  documents  and  q  queries.  Then  there  are  q  dichoto¬ 
mous  tables  each  of  which  is  similar  to  the  one  in  Figure  1(a)  representing  the 
outcomes  of  two-level  decisions  (relevant  or  nonrelevant)  by  the  filtering  system 
(predicted  label)  and  the  user/expert  (actual  label)  when  a  query  is  evaluated 
against  all  n  documents.  Macro-average  computes  precision  and  recall  separa¬ 
tely  from  the  dichotomous  tables  for  each  query,  and  then  computes  the  mean 
of  these  values.  Micro-average,  on  the  other  hand,  adds  up  the  q  dichotomous 
tables  all  together,  and  then  precision  and  recall  are  computed. 

For  the  purpose  of  plotting  a  single  summary  figure  for  recall  versus  precision 
values,  an  adjustable  parameter  is  used  to  control  assignment  of  documents  to 
profiles  (or  categories  in  text  categorization).  Furthermore  recall  and  precision 
values  at  different  parameter  settings  are  computed  to  show  trade-off  between 
recall  and  precision.  This  single  summary  figure  is  then  used  to  compute  what 
is  called  breakeven  point,  which  is  the  point  at  which  recall  is  approximately 
equal  to  precision  [4].  It  is  possible  to  use  linear  interpolation  to  compute  the 
breakeven  point  between  recall  and  precision  points. 


3  The  Experiment 

In  this  section,  we  describe  the  experimental  set  up  in  detail.  First,  we  describe 
how  the  Reuters-21578  dataset  is  parsed  and  the  vocabulary  for  indexing  is  con¬ 
structed.  Upon  discussion  of  our  approach  to  training,  the  experimental  results 
are  presented. 

3.1  Reuters-21578  Data  Set  and  Text  Representation 

To  experimentally  evaluate  the  proposed  information  filtering  method,  we  have 
used  the  corpus  of  Distribution  1.0  of  Reuters-21578  text  categorization  test 
collection  2.  This  collection  consists  of  21,578  documents  selected  from  Reuters 
newswire  stories.  The  documents  of  this  collection  are  divided  into  training  and 
test  sets.  Each  document  has  five  category  tags,  namely,  EXCHANGES,  ORGS, 
PEOPLE,  PLACES,  and  TOPICS.  Each  category  consists  of  a  number  of  topics 
that  are  used  for  document  assignment.  We  restrict  our  study  to  only  TOPICS 


2  Reuters-21578  collection  is  available  at:  http://www.research.att.com/~lewis. 
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category.  To  be  more  specific,  we  have  used  the  Modified  Apte  split  of  Reuters- 
21578  corpus  that  has  9,603  training  documents,  3,299  test  documents,  and  8,676 
unused  documents. 

The  training  set  was  reduced  to  7,775  documents  as  a  result  of  screening  out 
training  documents  with  empty  value  of  TOPICS  category.  There  are  135  topics 
in  the  TOPICS  category,  with  118  of  these  topics  occurring  at  least  once  in  the 
training  and  test  documents3.  Each  of  the  three  topics  out  of  118  ones  has  been 
assigned  to  only  one  document  in  the  test  set.  We  have  chosen  to  experiment 
with  all  of  these  118  topics  despite  the  fact  that  three  topic  categories  with  no  oc¬ 
currence  of  training  set  automatically  degrades  system  performance.  Figure  1(b) 
shows  some  statistics  about  the  number  of  documents  in  the  collection. 

We  have  produced  a  dictionary  of  single  words  excluding  numbers  as  a  result 
of  pre-processing  the  corpus  including  performing  parsing  and  tokenizing  the  text 
portion  of  the  title  as  well  as  the  body  of  both  training  and  unused  documents. 
We  have  used  a  universal  list  of  343  stop  words  to  eliminate  functional  words 
from  the  dictionary  4.  The  Porter  stemmer  algorithm  was  employed  to  reduce 
each  remaining  words  to  word-stems  form  5.  Since  any  word  occurring  only  a 
few  times  is  statistically  unreliable  [6],  the  words  occurring  less  than  five  times 
were  eliminated.  The  remaining  words  were  then  sorted  in  descending  order  of 
frequency. 

Our  filtering  framework  is  based  on  the  Vector  Space  Model  (VSM)  in  which 
documents  and  queries  are  represented  as  vectors  of  weighted  terms.  Let  tjk 
be  a  jth  term  of  document  with  identity  of  A;  in  a  collection.  One  common 
function  employed  for  computing  document  term  weight,  say  Wjk,  is  to  mul¬ 
tiply  term  frequency  (indicating  the  frequency  of  the  term  in  a  document) 
by  the  inverse  document  frequency  of  that  term  which  can  be  formulated  as 
Wjk  =  tjk  x  log (N/rij)  [6],  where  tjk  is  the  term  frequency,  N  is  the  total  num¬ 
ber  of  documents  in  the  collection,  and  rij  is  the  number  of  documents  containing 
tjk .  We  use  a  normalized  version  of  this  function  (i.e.,  making  magnitudes  of  do¬ 
cument  vectors  one).  A  document  is  assigned  to  a  topic  by  a  particular  classifier 
if  the  cosine  similarity  measure  between  this  document  and  the  query  is  grea¬ 
ter  than  or  equal  to  an  externally  supplied  threshold  value,  called  adjustable 
parameter  previously. 


3.2  Training 

In  contrast  to  information  retrieval  systems,  in  text  categorization  systems,  we 
have  neither  a  retrieval  output  nor  a  user  query.  Instead,  we  have  a  number  of 
topics  and  for  each  topic  the  document  collection  is  partitioned  into  training 

3  In  the  description  of  the  Reuters-21578  read-me  file  it  was  stated  that  the  number  of 
topics  with  one  or  more  occurrences  in  TOPICS  category  is  120,  but  we  have  found 
only  118.  The  missing  two  topics  were  assigned  to  unused  documents. 

4  The  stop  list  is  available  at:  http://www.iiasa.ac.at/docs/R_Library/libsrchs.html. 

5  The  source  code  for  the  Porter  Algorithm  is  found  at: 
http :  //ils  .unc  .edu/keyes /j  ava/porter /index  .html . 
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Table  1.  Top  16  topics  with  more  than  100  positive  examples. 


Name 

Train 

Test 

Name 

Train 

Test 

earn 

2877 

1087 

ship 

197 

89 

acq 

1650 

719 

corn 

182 

56 

money- fx 

538 

179 

money-supply 

140 

34 

grain 

433 

149 

dir 

131 

44 

crude 

389 

189 

sugar 

126 

36 

trade 

369 

118 

oilseed 

124 

47 

interest 

347 

131 

coffee 

111 

28 

wheat 

212 

71 

gnp 

101 

35 

and  test  cases.  The  training  set  contains  only  positive  examples  of  a  topic.  In 
this  sense,  the  training  set  is  not  a  counterpart  of  the  retrieval  output  due  to 
the  fact  that  we  do  not  have  any  negative  examples.  We  can,  however,  construct 
a  training  set  for  a  topic  that  consists  of  positive  and  negative  examples,  under 
the  plausible  assumption  that  any  document  considered  as  positive  example  for 
the  other  topics  and  not  in  the  set  of  positive  examples  of  the  topic  at  hand  is 
a  candidate  for  being  a  negative  example  of  this  topic. 

The  maximum  number  of  positive  examples  per  topic  in  the  corpus  is  2877 
and  the  average  is  84.  The  size  and  especially  the  quality  of  the  training  set  is  an 
important  issue  in  generating  an  induction  rule  set.  In  an  information  routing 
study  [7],  the  learning  method  was  not  applied  to  the  full  training  set  but  rather 
to  the  set  of  documents  in  the  local  region  for  each  query.  The  local  region  for  a 
query  was  defined  as  the  2000  documents  nearest  to  the  query,  where  similarity 
was  measured  using  the  inner  product  score  to  the  query  expansion  of  the  initial 
query.  Also,  in  [1]  the  rules  for  text  categorization  were  obtained  by  creating  local 
dictionaries  for  each  classification  topic.  Only  single  words  found  in  documents 
on  the  given  topic  were  entered  in  the  local  dictionary. 

In  our  experiment,  the  training  set  for  each  topic  consists  of  all  positive 
examples  while  the  negative  data  is  sampled  from  other  topics.  The  reason  for 
including  the  entire  set  of  positive  examples  is  that  SDA  is  an  enhanced  version 
of  a  relevance  feedback  algorithm  and  thus  a  larger  number  of  positive  examples 
makes  the  algorithm  produce  more  efficient  induction  rules.  Additionally,  the 
result  published  by  Dumais  et  al.  [3]  for  the  Reuters-21578  data  shows  that  with 
respect  to  micro-avaraged  score  of  the  SVM  (Support  Vector  Machine)  over 
multiple  random  samples  of  training  sets  for  the  top  10  categories  with  varying 
sample  size,  but  keeping  size  of  negative  data  the  same,  performance  of  the  SVM 
was  degraded  from  92%  to  72.6%  while  the  size  was  reduced  from  whole  training 
set  down  to  1%.  Another  important  finding  reported  in  that  study  shows  that 
performance  of  the  SVM  becomes  somewhat  unstable  when  a  category  has  fewer 
than  5  positive  training  examples.  Here  we  have  investigated  the  size  of  training 
set  from  a  different  perspective  and  tried  to  estimate  the  best  size  for  the  negative 
data  in  proportion  to  positive  ones,  which  is  described  in  the  remaining  part  of 
this  section. 
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Rnorm  Value 


(a) 


(b) 


Fig.  2.  (a)  Performance  of  the  top  16  topics  at  various  negative-to-positive  percentages, 
(b)  Rnorm  value  versus  breakeven  on  the  top  10  topics. 


To  estimate  the  best  size  for  negative  data  in  the  training  set,  we  have  expe¬ 
rimented  with  the  top  16  topics  of  the  Reuters-21578  data  as  shown  in  Table  1 
in  which  training  and  testing  sizes  of  positive  data  are  given.  We  have  trained 
each  of  these  topics  as  follows:  1)  Consider  all  positive  data,  2)  take  a  </>%  of 
the  positive  data  as  the  sample  size  of  the  negative  data,  3)  apply  SDA  on  this 
training  set  and  compute  breakeven  point,  4)  repeat  the  above  steps  but  sel¬ 
ecting  a  different  negative  sample  in  Step  2.  This  process  is  continued  until  all 
negative  data  (i.e.,  varying  sample  sizes  of  negative  data  for  each  of  the  16  top 
topics)  is  exhausted.  Figure  2(a)  shows  the  performance  for  <f>  =  10, 20,  •  •  •  ,  110. 
The  initial  query  of  SDA  is  first  set  to  the  mean  of  the  positive  examples  and 
for  subsequent  iterations  the  initial  query  is  simply  set  to  the  query  obtained  in 
the  previous  iteration.  We  set  Rnorm  value  to  its  maximum  value  (i.e.  1.0)  and 
assert  on  maximum  of  150  iterations  in  case  the  value  of  Rn0rm  is  not  reached. 

As  indicated  in  Figure  2(a),  the  quality  of  the  induction  is  effectively  en¬ 
hanced  best  when  the  proportion  of  negative  data  over  positive  data  becomes 
10%  (local  maximal  point)  with  respect  to  ratio  of  increase  in  breakeven  point 
over  the  one  in  size  of  negative  data.  Besides  the  maximum  point  is  reached 
(i.e.,  the  best  performance  in  the  absolute  sense)  when  proportion  of  negative 
data  becomes  50%  or  80%.  It  is  worth  stating  that  the  performance  is  abruptly 
degraded  when  the  size  of  negative  data  exceeds  that  of  positive  data.  For  the 
concern  of  obtaining  the  best  quality  of  induction,  we  fixed  the  size  of  negative 
sample  to  50%  of  the  positive  set.  For  example,  for  ‘grain’  topic  in  Table  1(b), 
we  considered  all  the  433  training  examples  as  positive  data  and  216  as  negative 
data  sampled  from  other  topics. 

The  choice  of  Rnorm  value  used  in  terminating  criteria  of  the  SDA  algorithm 
is  important  in  the  learning  process.  This  is  because  there  is  an  application- 
dependent  trade  off  between  quality  of  a  query  and  processing  overhead  Fi¬ 
gure  2(b)  shows  the  performance  trade  offs  on  various  values  of  Rnorm  on  the 
top  10  topics.  As  the  value  of  Rnorm  decreases,  the  system  performance  also 
decreases  until  a  point  where  the  query  produced  for  subsequent  values  of  Rnorm 
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Average  Precision /Recall 


Recall 


Fig.  3.  (c)  Average  Precision/Recall  on  all  118  topics  with  breakeven  point  of  81.28%. 


remains  unchanged.  This  is  because  on  the  first  run,  the  algorithm  yields  a  query 
with  a  higher  Rn0rm  value  than  whatever  the  supplied  Rnorm  value  is. 

3.3  Results 

The  average  precision/recall  graph  of  the  118  topics  is  shown  in  Figure  3.  The 
graph  is  magnified  and  plotted  around  the  area  where  recall  and  precision  are 
approximately  equal  (i.e.  breakeven  point.)  From  the  graph,  the  breakeven  is 
approximately  81.28%.  As  a  comparative  study,  Table  2  presents  results  of  SDA 
and  other  five  inductive  algorithms  that  were  recently  experimented  on  Reuters- 
21578  dataset  [3],  Findsim  method  is  a  variant  of  Rocchio’s  method  for  relevance 
feedback.  The  weight  of  each  term  is  the  average  (or  centroid)  of  its  weight  in 
positive  instances  of  the  topic.  We  may  compare  SDA  with  this  method  since 
it  is  a  first-order  approximation.  The  list  of  names  such  as  NBayes,  BayesNets, 
Trees,  and  SVM  in  Table  2  stand  for  Naive  Bayes,  Bayes  Nets,  Decision  Trees, 
and  Linear  Support  Vector  Machines  methods,  respectively.  For  further  details 
of  these  methods  the  reader  is  referred  to  [3] . 

SDA  outperforms  Findsim,  NBayes,  and  BayesNet,  and  is  almost  as  good  as 
the  Decision  Trees  method.  It  is  however  outperformed  by  the  Linear  SVM  me¬ 
thod  due  to  the  fact  that  relevance  feedback  methods  (including  SDA)  require  as 
large  size  of  positive  data  as  possible  for  drifting  the  query  towards  the  solution 
region.  Therefore,  for  topics  with  small  number  of  positive  examples,  which  re¬ 
present  the  majority  of  the  topics  in  the  Reuters-21578  data,  the  optimal  query 
which  is  close  to  the  solution  region  is  hard  to  find  and  the  performance  of  the 
algorithm  SDA  is  the  same  as  Findsim  method  on  these  topics.  Nevertheless, 
on  average  it  outperforms  the  Findsim  method  by  a  significant  margin  which 
upholds  the  plausible  fact  that  higher-order  approximation  methods,  such  as 
SDA,  outperforms  their  counterpart  first-order  approximation  methods,  such  as 
Findsim. 
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Table  2.  Comparing  results  with  other  five  inductive  algorithms.  Breakeven  is  com¬ 
puted  on  top  10  topics  and  on  overall  118  topics. 


Topic 

Findsim 

NBayes 

BayesNets 

Trees 

SDA 

SVM 

earn 

92.9% 

95.9% 

95.8% 

97.8% 

98.5% 

98.0% 

acq 

64.7% 

87.8% 

88.3% 

89.7% 

95.9% 

93.6% 

money-fx 

46.7% 

56.6% 

58.8% 

66.2% 

78.4% 

74.5% 

grain 

67.5% 

78.8% 

81.4% 

85.0% 

90.6% 

94.6% 

crude 

70.1% 

79.5% 

79.6% 

85.0% 

86.5% 

88.9% 

trade 

65.1% 

63.9% 

69.0% 

72.5% 

76.06% 

75.9% 

interest 

63.4% 

64.9% 

71.3% 

67.1% 

77.29% 

77.7% 

wheat 

68.9% 

69.7% 

82.7% 

92.5% 

82.2% 

91.9% 

ship 

49.2% 

85.4% 

84.4% 

74.2% 

88.1% 

85.6% 

corn 

48.2% 

65.3% 

76.4% 

91.8% 

82.36% 

90.3% 

Avg.  Top  10 
Avg  All  Cat. 

64.6% 

61.7% 

81.5% 

75.2% 

85.0% 

80.0% 

88.4% 

N/A 

87.99% 

81.28% 

92.0% 

87.0% 
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Abstract.  As  the  mounds  of  information  and  the  number  of  Internet 
users  grow,  the  problem  of  indexing  and  retrieving  of  electronic  informa¬ 
tion  resources  becomes  more  critical.  The  existing  search  systems  tend 
to  generate  misses  and  false  hits  due  to  the  fact  that  they  attempt  to 
match  the  specified  search  terms  without  proper  context  in  the  target 
information  resource.  In  environments  that  contain  many  different  ty¬ 
pes  of  data,  content  indexing  requires  type-specific  processing  to  extract 
indexing  information  effectively.  The  COncordia  INdexing  and  Disco¬ 
very  (Cindi)  system  is  a  system  devised  to  support  the  registration  of 
indexing  meta-data  for  information  resources  and  provide  a  convenient 
system  for  search  and  discovery.  The  Semantic  Header,  containing  the 
semantic  contents  of  information  resources  stored  in  the  Cindi  system, 
provides  a  useful  tool  to  facilitate  the  searching  for  documents  based  on 
a  number  of  commonly  used  criteria.  This  paper  presents  an  automatic 
tool  for  the  extraction  and  storage  of  some  of  the  meta-information  in 
a  Semantic  Header  and  the  classification  scheme  used  for  generating  the 
subject  headings. 


1  Introduction 

Rapid  growth  in  data  volume,  user  base  and  data  diversity  render  Internet- 
accessible  information  increasingly  difficult  to  use  effectively.  The  number  of 
information  sources,  both  public  and  private,  available  on  the  Internet  are  in¬ 
creasing  almost  exponentially.  They  include  text,  computer  programs,  books, 
electronic  journals,  newspapers,  organisational,  local  and  national  directories  of 
various  types,  sound  and  voice  recordings,  images,  video  clips,  scientific  data, 
and  private  information  services  such  as  price  lists  and  quotations,  databases 
of  products  and  services,  and  speciality  newsletters  [3],  There  is  a  need  for  an 
automated  search  system  that  allows  easy  search  for  and  access  to  relevant  re¬ 
sources  available  on  the  Internet  which  in  turn  requires  proper  indexing  of  the 
available  information.  The  semantics  of  the  resource  are  exploited  in  the  cur¬ 
rent  system  to  extract  and  summarise  the  relevant  meta-information(Semantic 
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Headers  [2])  to  support  its  discovery.  Specialised  databases  maintain  archives  of 
these  Semantic  Headers(SH)  which  could  be  searched  by  another  component  of 
Cindi  which  features  cooperating  distributed  expert  systems  and  helps  users  in 
locating  pertinent  documents. 

The  Cindi  system  provides  mechanisms  to  register,  search  and  manage  the 
SHs,  with  the  help  of  easy  to  use  graphical  user  interfaces.  Cindi  avoids  problems 
caused  by  differences  in  semantics  and  representation  as  well  as  incomplete  and 
incorrect  data  cataloguing  by  using  a  standardized  subject  heading  hierarchy. 
This  meta-information  could  be  entered  by  the  primary  resource  provider  with 
the  help  of  an  Automatic  Semantic  Header  Generator  (ASHG)  described  in 
this  paper.  ASHG  is  a  software  that  assists  the  authors  of  documents  to  semi- 
automatically  generate  many  of  the  fields  of  the  SH  and  hence  assist  them  in 
the  registration  of  their  documents  in  the  Cindi  system.  One  of  the  main  tasks 
of  ASHG  is  to  classify  a  document  under  a  list  of  subject  headings  as  described 
herein.  As  the  author  is  required  to  verify  and  complete  the  ASHG  generated 
Semantic  Header  entry,  the  potential  for  its  accuracy  is  high. 

The  paper  is  organized  as  follows:  in  section  2,  we  introduce  the  Cindi  system. 
Section  3  covers  our  approach  to  the  building  of  the  thesaurus  used  in  ASHG 
system  and  section  4  describes  its  components.  Following  this,  we  give  the  results 
of  our  tests  to  generate  the  SH  on  a  set  of  documents  prepared  in  the  HTML, 
IAfyjX,  RTF  and  plain  text  format  and  our  conclusions. 

2  The  Cindi  System 

Attempts  to  provide  easy  search  of  relevant  documents  has  led  to  a  number 
of  systems  [1,5,7,8,13,15,18,19,20],  However,  the  problem  with  many  of  these  is 
that  their  selectivity  of  documents  is  often  poor  [3].  The  chances  of  getting  in¬ 
appropriate  documents  and  missing  relevant  information  because  of  poor  choice 
of  search  terms  are  great.  Hence,  there  is  a  need  for  the  development  of  a  system 
which  allows  easy  search  for  and  access  to  resources  available  on  the  Internet. 
Using  a  standard  index  structure  and  building  an  expert  system  based  biblio¬ 
graphic  system  using  standardised  control  definitions  and  terms  can  alleviate 
the  problem  and  provide  fast,  efficient  and  easy  access  to  the  Web  documents. 
For  cataloguing  and  searching,  Cindi  uses  a  meta-data  description  called  SH[4] 
to  describe  an  information  resource.  The  SH  includes  those  elements  that  are 
most  often  used  in  the  search  for  an  information  resource.  Since  the  majority  of 
searches  begin  with  a  title,  name  of  the  authors  (70%),  subject  and  sub-subject 
(50%)  [6],  Cindi  requires  the  entry  for  these  elements  in  the  SH.  Similarly,  the 
abstract  and  annotations  are  relevant  in  deciding  whether  or  not  a  resource  is 
useful,  so  they  are  included  too[3].  The  components  of  the  SH  are:  Title,  Alt- 
title,  Language,  Character  Set,  Keyword,  Identifier,  Date,  Version,  Classification, 
Coverage,  System  Requirements,  Genre,  Source  and  Reference,  Cost,  Abstract, 
Annotations  and  User  ID,  Password. 

Preparing  the  primary  source’s  SH  requires  identifying  it  as  to  its  subject, 
title,  author,  keywords,  abstract,  etc.  These  problems  are  addressed  by  Cindi, 
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which  provides  a  mechanism  to  register,  manage  and  search  the  bibliographic 
information. 

The  overall  Cindi  system  uses  knowledge  bases  and  expert  sub-systems  to 
help  the  user  in  the  registering  and  search  processes.  The  index  generation  and 
maintenance  sub-system  uses  Cindi ’s  thesaurus  to  help  the  provider  of  the  re¬ 
source  select  the  most-appropriate  standard  terms  for  items  such  as  subject, 
sub-subject  and  keywords.  Similarly,  another  expert  sub-system  is  used  to  help 
the  user  in  the  search  for  appropriate  information  resources  [2], 

The  SH  information  entered  by  the  provider  of  the  resource  using  a  graphi¬ 
cal  interface  is  relayed  from  the  user’s  workstation  by  a  client  process  to  the 
database  server  process  at  one  of  the  nodes  of  a  distributed  database  system 
(SHDDB).  The  node  is  chosen  based  on  its  proximity  to  the  workstation  or  on 
the  subject  of  the  index  record.  From  the  point  of  view  of  the  users  of  the  sy¬ 
stem,  the  underlying  database  may  be  considered  to  be  a  monolithic  system.  In 
reality,  it  would  be  distributed  and  replicated  allowing  for  reliable  and  failure- 
tolerant  operations.  The  interface  hides  the  distributed  and  replicated  nature  of 
the  database.  On  receipt  of  the  information,  the  server  verifies  the  correctness 
and  authenticity  of  the  information  and  on  finding  everything  in  order,  sends 
an  acknowledgment  to  the  client.  The  server  node  is  responsible  for  locating  the 
partitions  of  the  SHDDB  where  the  entry  should  be  stored  and  forwards  the 
replicated  information  to  appropriate  nodes.  The  various  sites  of  the  database 
work  in  a  cooperating  mode  to  maintain  consistency  of  the  replicated  portion. 
The  replicated  nature  of  the  database  also  ensures  distribution  of  load  and  ensu¬ 
res  continued  access  to  the  bibliography  when  one  or  more  sites  are  temporarily 
nonfunctional. 

Cindi  search  sub-system  guides  the  user  in  entering  the  various  search  items 
in  a  graphical  interface  similar  to  the  one  used  by  the  index  entry  system.  Once 
the  user  has  entered  a  search  request,  the  client  process  communicates  with 
the  nearest  SHDDB  catalogue  to  determine  the  appropriate  site  of  the  SHDDB 
database.  Subsequently,  the  client  process  communicates  with  this  database  and 
retrieves  one  or  more  SHs.  The  result  of  the  query  could  then  be  collected  and 
sent  to  the  user’s  workstation.  The  contents  of  these  headers  are  displayed,  on 
demand,  to  the  user  who  may  decide  to  access  one  or  more  of  the  actual  resources. 


3  ASHG’s  Thesaurus 

ACM[17],  INSPEC[14]  and  Library  of  Congress  Subject  Headings  (LCSH)  [16] 
were  the  main  building  blocks  of  Cindi’s  three  level  Subject  Hierarchy  which 
currently  is  limited  to  the  domains  of  Computer  Science  and  Electrical  Enginee¬ 
ring.  ASHG’s  computer  science  subject  hierarchy  uses  ACM’s  subject  hierarchy 
as  the  starting  point,  and  electrical  engineering  subject  hierarchy  is  based  on 
that  of  INSPEC’s.  We  have  exploited  LCSH’s  subject  headings  relations  to  re¬ 
fine  both  hierarchies.  LCSH  contained  relations  between  subject  headings  such 
as  BT  (Broader  Term),  NT  (Narrow  Term),  UF  (Used  For),  and  RT  (Related 
To).  In  order  to  augment  ACM  and  INSPEC  subject  hierarchies,  a  search  for  an 
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ACM  or  INSPEC  subject  heading  was  made  in  LCSH.  If  a  match  was  found,  the 
narrow  terms  found  in  LCSH  under  the  matched  subject  were  added  to  the  list  of 
subjects  or  terms  under  the  ACM  or  INSPEC’s  matched  subject  heading.  This 
augmentation  produced  a  hierarchy  composed  of  five  or  six  levels.  Since  Cindi’s 
subject  hierarchy  was  limited  to  only  three  levels,  the  following  rules  illustrated 
in  Figure  1,  were  applied  to  merge  these  subject  headings.  The  (LeveLO)  subject 
is  Computer  Science  or  Electrical  Engineering.  Some  of  the  subject  headings 
found  in  the  LeveLl  and  LeveL2  augmented  subject  hierarchies  were  concaten¬ 
ated  to  form  the  Cindi’s  LeveLl  subject  heading.  The  same  rule  was  applied  on 
subject  headings  at  Levels  and  LeveLf  to  yield  Cindi’s  Levels  subject  heading. 
The  Levels  and  Levels  subjects  were  used  as  controlled  terms  associated  with 
Cindi’s  Levels  subject  headings. 

The  resulting  subject  hierarchy  has  three  levels  and  a  set  of  control  terms 
associated  with  the  lowest  level  subject  headings. 

The  reason  behind  the  Control  Term  Subject  association  is  to  extract  or 
classify  the  primary  source  under  a  number  of  subject  headings  by  comparing 
the  significant  list  of  words  contained  in  the  document  with  the  list  of  control 
terms.  An  association  between  the  control  terms  and  their  corresponding  subject 
headings  is  created. 

Each  control  term  has  three  lists  of  subject  headings  attached  to  it.  The  con¬ 
trol  terms  are  based  on  the  terms  found  in  ASHG’s  subject  hierarchy  and  the 
additional  terms  that  are  associated  with  Levels  subject  headings.  For  each  sub¬ 
ject  heading  and  the  additional  controlled  terms,  we  use  their  constituent  English 
none  noise  words  as  their  corresponding  control  terms.  For  example,  the  control 
term  compute  will  be  associated  with  Computer  Science  general  subject  heading. 
Similarly,  the  control  term  hardware  will  be  associated  with  Hardware  integrated 
circuits  and  Hardware  performance  and  reliability  leveLl  subject  headings  and 
Hardware  Simulation  Design  Aids  level_2  subject  heading.  Each  controlled  term 
is  associated  with  one  or  more  subject  headings. 

Mapping  ASHG’s  subject  heading  terms  into  control  terms  involves:  remo¬ 
ving  noise  (stop)  words;  stemming  the  remaining  words  to  find  the  the  root  and 
associating  the  root  with  the  corresponding  subject  heading. 

4  ASHG  Implementation 

In  this  section,  we  present  the  implementation  details  of  the  Automatic  Semantic 
Header  Generator  (ASHG)  of  the  Cindi  system.  This  is  an  important  step  in 
providing  the  author  of  a  document  a  draft  SH  with  an  initial  set  of  subject 
classifications  and  a  number  of  components  of  the  SH  for  the  document.  The 
ASHG  scheme  takes  into  account  both  the  occurrence  frequency  and  positional 
weight  of  keywords  found  in  the  document.  Based  on  the  document’s  keywords, 
ASHG  assigns  a  list  of  subject  headings  by  matching  those  keywords  with  the 
controlled  terms  found  in  the  controlled  term  subject  association.  The  ASHG  also 
extracts  some  of  the  meta-information  from  a  document  such  as  title,  abstract, 
keywords,  dates,  author,  author’s  information,  size  and  type. 
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ASHG  uses  the  syntax  of  documents  in  HTML,  LaTex,  RTF  or  text  to  extract 
the  document’s  meta-information.  ASHG  extracts  summary  information,  such 
as  the  title,  keywords,  dates  of  creation,  author,  author’s  information,  abstract 
and  size.  In  tagged  documents,  the  author  might  explicitly  tag  some  of  the  fields 
to  be  extracted.  In  case  these  fields  are  not  explicitly  tagged,  ASHG  attempts  to 
extract  them  using  heuristics.  However,  if  the  explicit  keywords  were  not  found 
in  the  document,  then  words  found  in  the  title,  abstract  and  other  tagged  words 
would  be  used  to  extract  an  implicit  list  of  keywords. 


Fig.  1.  ASHG’s  extraction  steps 


Generating  an  Implicit  List  of  Keywords  and  Words  Used  in 
Document  Classification 

ASHG  generates  an  implicit  list  of  keywords  in  case  explicit  keywords  were  not 
found  in  the  document,  the  system  derives  a  list  of  words  from  the  words  found 
in  the  title,  abstract,  and  other  tagged  fields.  This  list  of  derived  words  will  also 
be  used  in  classifying  the  document.  However,  if  the  keywords  were  explicitly 
stated  in  the  document,  then  ASHG  will  augment  them  with  a  list  of  words  from 
the  words  found  in  the  title,  abstract,  and  other  tagged  fields. 

Generating  both  lists  of  words  relies  on  the  stemming  process  that  will  map 
the  words  into  their  root  words,  the  stemmed  word  frequency  of  occurrence  and 
the  word  location  in  the  document.  Because  the  terms  are  not  equally  useful  for 
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content  representation,  it  is  important  to  introduce  a  term  weighting  system  that 
assigns  high  weights  for  important  terms  and  low  weight  for  the  less  important 
terms  [11],  The  weight  assignment  uses  the  following  scheme: 

If  the  keywords  are  explicitly  included  in  the  document,  they  convey  some 
important  concepts  and  hence  are  assigned  the  highest  weight  of  five.  Usually, 
words  found  in  the  abstract  are  the  second  most  important  words,  and  are  as¬ 
signed  a  weight  of  four.  The  words  in  the  title,  are  assigned  a  weight  of  three. 
The  word  appearing  in  the  other  tagged  fields,  are  assigned  a  weight  of  two. 

Each  numeric  weight  is  a  class  by  itself  defining  the  words’  location.  The 
range  of  class  weight  generated  will  be  between  two  and  14,  depending  on  the 
positions  where  a  word  appears. 

For  each  class,  we  set  the  maximum  class  frequency  to  be  the  frequency  of  oc¬ 
currence  of  a  term  found  most  often  in  that  class.  For  instance,  if,  in  class  four,  we 
had  three  terms  having  two,  four  and  six  as  frequencies,  the  system  would  select 
six  as  the  maximum  class  four  frequency.  The  words’  frequencies  are  compa¬ 
red  with  their  corresponding  maximum  class  frequency.  For  low  weighted  classes 
such  as  two  and  three,  significant  terms  have  the  maximum  class  frequencies. 
Thus,  limiting  the  number  of  significant  terms.  However,  all  terms  found  in  class 
eight  and  more  are  significant  regardless  of  their  frequency  of  occurrence. 

Table  1.  Weight  and  Frequency  numbers  used  in  extracting  terms 


Term  Weight 

Term  Frequency 

2 

Maximum  Class  2  frequency 

3 

Maximum  Class  3  frequency 

4 

Greater  or  equal  to  Maximum  Class  4  frequency  minus  1 

5 

Greater  or  equal  to  Maximum  Class  5  frequency  minus  1 

6 

Greater  or  equal  to  Maximum  Class  6  frequency  minus  2 

7 

Greater  or  equal  to  Maximum  Class  7  frequency  minus  3 

>  8 

All 

Two  lists  of  words  are  generated.  The  first  one  containing  only  the  root  words 
or  control  terms  found  in  Cindi’s  thesaurus.  This  list  of  control  terms  is  used  in 
the  document’s  subject  classification  scheme.  The  second  list  contains  the  most 
significant  root  words  not  found  in  Cindi’s  thesaurus.  If  no  keywords  were  found 
in  the  document,  ASHG  extracts  words  having  a  term  weight  more  than  four  and 
their  corresponding  frequencies  of  occurrence  is  the  same  as  the  ones  tabulated. 
These  words  are  the  document’s  keywords.  In  generating  a  list  of  control  terms 
used  to  classify  the  document,  terms  having  weight  of  two  or  more  are  extracted. 
The  extracted  words  have  the  frequencies  of  occurrence  as  tabulated  in  Table  1 . 


ASHG’s  Document  Subject  Headings  Classification  Scheme 

An  important  step  in  constructing  the  draft  SH  is  to  automatically  assign  subject 
headings  to  the  documents.  The  title,  explicitly  stated  keywords,  and  abstract 
are  not  enough  by  themselves  to  convey  the  ideas  or  subjects  of  the  document. 
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Since  the  author  tries  to  convey  or  to  summarise  his  ideas  in  the  previously 
mentioned  fields,  there  is  a  need  to  use  all  none  noise  words  found  in  those  fields. 
To  assign  the  subject  headings,  ASHG  uses  the  resulting  list  of  significant  words 
generated  from  the  previous  section  and  the  control  term  to  subject  association. 
The  subject  heading  classification  scheme  relies  on  passing  weights  from  the 
significant  terms  to  their  associated  subjects,  and  selecting  the  highest  weighted 
subject  headings.  The  following  algorithm  is  used  to  construct  the  three  levels 
of  subject  headings: 

1.  For  each  term  found  in  both  Cindi’s  control  terms  and  the  generated  list  of 
words,  the  system  traces  the  control  term’s  attached  list  of  subjects  (list  of 
levelO,  levell  and  level2 )  headings,  and  adds  the  subject  headings  to  their 
corresponding  list  of  possible  subject  headings. 

2.  Weights  are  also  assigned  to  the  subject  hierarchies.  The  weight  for  a  subject 
is  given  according  to  where  the  term  matching  its  controlled  term  was  found. 
A  subject  heading  having  a  term  or  set  of  terms  occurring  in  both  title  and 
abstract,  for  instance,  gets  a  weight  of  seven.  The  matched  terms’  weights 
are  passed  to  their  subject  headings. 

3.  The  system  extracts  Level-2,  LeveLl  and  Level-0  subject  headings  having 
the  highest  weights  from  the  three  lists  of  possible  subject  headings. 

4.  After  building  the  three  lists  for  the  three  level  subject  headings,  the  system 
selects  the  subjects  using  the  bottom-up  scheme: 

a)  Having  selected  the  highest  weighted  level-2  subject  headings,  the  system 
derives  their  leveLl  parent  subject  headings. 

b)  An  intersection  is  made  between  the  derived  level-1  subject  headings  and 
the  list  of  the  highest  weighted  level.l  subject  headings.  The  common 
level-1  subjects  are  the  document’s  leveLl  subject  headings. 

c)  The  system  uses  the  same  procedure  in  selecting  level-0  subject  headings. 

Once  the  process  of  extracting  the  meta-information  is  terminated,  the  SH  is 
displayed  for  the  source  provider  to  modify,  add  or  remove  some  of  the  attributes. 
Once  the  provider  finishes,  the  semantic  header  can  be  registered  in  the  Cindi 
database. 

5  Analysis  of  ASHG’s  Results  and  Conclusions 

The  experiments  described  here  are  designed  to  test  the  accuracy  of  the  gene¬ 
rated  index  and  the  subject  headings  classification  results.  After  applying  the 
ASHG  on  a  set  of  documents,  the  generated  index  fields  such  as  title,  keywords, 
abstract  and  author  are  compared  with  those  that  are  found  in  the  document. 

The  experiments  were  conducted  on  a  number  of  documents[21].  These  do¬ 
cuments  dealt  with  computer  science  and  electrical  engineering  subjects.  Each 
of  these  documents  was  rendered  manually  in  the  four  formats.  ASHG  was  able 
to  extract  all  the  explicitly  stated  fields  such  as  title,  abstract,  keywords,  and 
author’s  information  with  a  hundred  percent  accuracy.  If  the  abstract  was  not 
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explicitly  stated,  ASHG  was  able  to  automatically  generate  an  abstract  that 
would  describe  the  paper.  However,  ASHG’s  implicit  keyword  extraction  gene¬ 
rated  a  list  of  words  which  included  some  words  that  were  insignificant.  These 
insignificant  words  in  turn  lead  to  the  diversion  in  subject  classification. 

The  ASHG’s  automatic  subject  headings  classification  results  are  compared 
with  the  INSPEC’s  classification  and  with  what  the  papers’  authors  would  re¬ 
gard  as  good  subject  classifications  and  poor  ones.  For  the  former  we  consulted 
the  authors  about  the  subject  heading  generated  by  ASHG  system  for  their  do¬ 
cuments.  The  results  are  tabulated  in  Table  2.  which  shows  a  greater  than  50% 
of  acceptable  subject  headings.  Some  of  the  ASHG’s  subject  classifications  had 
different  words  than  INSPEC’s  even  though  they  described  the  same  subject. 
That  was  due  to  the  fact  that  our  computer  science  subject  classification  was 
built  from  ACM  and  not  from  INSPEC. 

Table  2.  Summary  of  ASHG’s  tests 


Document 

Type 

Avg.  Number  of  Subject 
Headings  Generated 

Avg.  Number  of  Acceptable 
Subject  Headings 

Percent  of  Inspec 
Heading  Discovered 

HTML 

4.9 

66.1% 

74% 

4.4 

63% 

80% 

RTF 

4.8 

60.6% 

65% 

Text 

5.9 

57.0% 

80% 

ASHG’s  was  able  to  generate  between  65%  and  80%  of  the  subject  heading 
that  were  generated  by  professional  catalogers.  However,  since  ASHG  produced, 
on  the  average  more  classifications,  the  accuracy  was  lower  at  about  22%.  Since 
our  system  was  only  based  on  the  frequency  and  location  of  words  in  a  document 
to  determine  the  document’s  keywords  and  subject  classification,  it  missed  the 
importance  of  the  word  senses  and  the  relationship  between  words  in  a  sentence. 
The  simplistic  system  did  not  capture  the  concepts  behind  the  documents,  or 
the  ideas  that  the  author  was  trying  to  convey.  Our  results  support  the  idea  that 
word  frequency  and  location  are  not  enough  in  information  retrieval.  However, 
since  the  ASHG’s  result  will  be  used  as  a  starting  point  by  the  author,  he/she 
has  the  opportunity  to  correct  the  errors  and  include  fields  of  the  SH  not  given 
before  registering  it.  Further  work  is  required  in  refining  the  subject  classification 
to  reduce  the  number  of  poor  classifications. 

In  conclusion,  we  believe  that  resolving  word  senses  and  determining  the  re¬ 
lationships  that  those  words  have  to  one  another  will  have  the  greatest  impact  on 
refining  the  ASHG’s  subject  classification  scheme.  Therefore,  we  plan  to  pursue 
semantic  level  language  processing  in  the  future. 
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Abstract.  One  of  the  main  issues  in  the  field  of  information  retrieval  is  to 
bridge  the  terminological  gap  existing  between  the  way  in  which  users  specify 
their  information  needs  and  the  way  in  which  queries  are  expressed.  One  of  the 
approaches  for  this  purpose,  called  Rule  Based  Information  Retrieval  by 
Computer  (RUBRIC),  involves  the  use  of  production  rules  to  capture  user  queiy 
concepts  (or  topics).  In  RUBRIC,  a  set  of  related  production  rules  is  represented 
as  an  AND/OR  tree.  The  retrieval  output  is  determined  by  Boolean  evaluation 
of  the  AND/OR  tree.  However,  since  the  Boolean  evaluation  ignores  the  term- 
term  association  unless  it  is  explicitly  represented  in  the  tree,  the  terminological 
gap  between  users’  queries  and  their  information  needs  can  still  remain.  To 
solve  this  problem,  we  adopt  the  generalized  vector  space  model  (GVSM)  in 
which  the  term-term  association  is  well  established,  and  extend  the  RUBRIC 
model  based  on  GVSM.  Experiments  have  been  performed  on  some  variations 
of  the  extended  RUBRIC  model,  and  the  results  have  also  been  compared  to  the 
original  RUBRIC  model  based  on  recall-precision. 


1  Introduction 

Many  intelligent  retrieval  approaches  have  been  studied  to  meet  the  users’  individual 
preferences  properly  [2,  6,  7].  However,  there  always  exists  a  terminological  gap 
between  the  way  in  defining  queries  and  the  way  in  representing  documents.  One  of 
the  approaches  proposed  in  the  literature  for  this  purpose  involves  the  use  of 
production  rules  to  capture  user  query  concepts  (or  topics).  The  central  ideas  of  such 
an  approach  were  introduced  in  the  context  of  a  system  called  Rule  Based  Information 
Retrieval  by  Computer  (RUBRIC)  [6].  In  RUBRIC,  a  set  of  related  production  rules  is 
represented  as  an  AND/OR  tree,  called  a  rule-base  tree.  RUBRIC  allows  the 
definition  of  detailed  queries  starting  at  a  conceptual  level.  The  retrieval  output  is 
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determined  by  Boolean  evaluation  of  the  AND/OR  tree.  The  RUBRIC  concepts  were 
incorporated  into  a  commercial  system  called  TOPIC.  Though  the  system  was  not 
popular  due  to  difficulties  in  generating  rule-bases,  recent  research  has  developed 
ways  to  automate  the  creation  and  update  (using  relevance  feedback)  of  rule-bases  [4, 
3].  Other  challenges  in  successful  implementation  of  RUBRIC  deal  with  efficiency 
issues  and  the  method  of  evaluating  rule-bases.  The  efficiency  issue  has  been 
addressed  in  [1,  5],  However,  the  method  of  evaluation,  using  Max  and  Min  for  OR 
and  AND,  respectively,  still  has  limitations. 

To  solve  the  second  problem,  in  this  paper,  we  adopt  the  generalized  vector 
space  model  (GVSM)  [10,  11].  In  GVSM,  term-term  associations  are  computed  as  an 
integral  part  of  the  automatic  indexing  process.  We  propose  a  way  to  integrate  the 
ideas  of  concept  based  retrieval  in  RUBRIC  with  the  generalized  vector  space  model. 
Experiments  have  been  conducted  on  some  variations  of  the  integrated  model.  The 
results  show  that  the  integrated  model  is  more  effective  than  the  original  one  in  terms 
of  recall-precision. 


2  Review  of  RUBRIC 


In  RUBRIC,  concepts  of  interest  are  formulated  using  a  top-down  refinement 
strategy.  In  a  top-down  strategy,  the  first  step  is  to  express  a  given  request  as  a  singie 
concept.  The  next  step  is  to  refine  the  initial  concept  by  decomposing  it  into  a  set  of 
component  parts  that  are  related  through  either  the  AND  or  OR  logical  operator.  The 
individual  components  may  take  the  form  of  a  new  concept  defined  at  a  different 
abstraction  level,  a  text  expression,  or  a  single  index  term.  In  each  case,  a  weight 
value  is  assigned  to  the  individual  concept-component  pairs  that  are  formed  during 
the  decomposition  process.  The  assigned  weight  value  represents  the  user’s  belief  in 
the  degree  to  which  a  given  component  characterizes  the  related  concept. 


Violent-act 


Violent-Action-Form 


0.4 

N  Injuries 


dead" 


0.7 


"slay" 


0.8 

"wound" 


Fig.  1.  Rule-base  tree  for  concept  Violent-act 

Fig.  1  shows  the  rule-base  tree  for  concept  Violent-act  where  the  leaf  nodes  are 
index  terms  and  are  enclosed  by  double  quotations,  the  internal  nodes  are  concepts 
and  the  weights  are  displayed  along  the  edges  connecting  concepts  and  components. 
The  concept  Violent-act  is  first  being  decomposed  into  two  component  concepts. 
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Violent-Action-Form  and  Injuries,  which  are  related  to  Violent-act  with  AND 
operator.  The  AND  operator  is  denoted  by  drawing  a  line  between  its  branches.  If 
there  is  no  line  connecting  the  branches,  the  relationship  is  OR. 

The  evaluation  of  the  relevancy  (RSV:  Retrieval  Status  Value)  of  a  document  to 
concept  Violent-act  could  be  processed  by  a  bottom-up  strategy.  For  example,  if  a 
document  contains  the  words  “gun,”  “shot,”  “bomb,”  “slay,”  and  “dead”  and  no  other 
words  in  the  example  rule  base  are  referred  to,  then  the  index  term  nodes  “gun,” 
“shot,”  “bomb,”  “slay,”  and  “dead”  will  receive  a  weight  of  1.0  and  all  other  index 
term  nodes  will  receive  a  weight  of  0.  For  example,  the  concept  Weapon  is  composed 
of  two  component  “gun”  and  “rifle,”  and  the  weight  for  “gun”  is  1.0  and  “rifle”  is  0. 
Since  the  operator  for  Weapon  is  OR,  the  relevancy  to  concept  Weapon  is  assigned 
to  be  the  max  value  of  the  product  of  the  weights  of  its  components  and  the 
corresponding  weights  connecting  its  components.  In  this  particular  case,  the  result  is 
Max  (1. 0*0.5,  0*0.7)  =  0.5.  In  a  similar  way,  finally  we  can  get  the  relevancy  (RSV) 
of  the  document  to  the  concept  Violent-act,  which  comes  out  to  be  0.3. 

In  order  to  make  the  computation  more  efficient.  Minimal  Term  Sets  (MTSs) 
should  be  generated  through  static  analysis  of  rule-base  [1,  5].  A  minimal  term  set 
consists  of  index  terms  that  are  necessary  to  make  sure  that  the  retrieval  status  value 
of  a  concept  is  larger  than  0.  That  means  if  any  of  the  index  terms  in  the  MTS  is  taken 
out,  then  the  RSV  would  be  0.  Here,  we  list  all  the  MTSs  and  their  RSVs  for  the 
concept  Violent-act  in  Table  1. 


Table  1.  MTSs  and  RSVs  for  concept  Violent-act 


MTS 

RSV 

MTS 

1 

{"gun,"  "shot,”  "dead"} 

0.3 

MTS 

2 

{"gun,"  "shot,"  "wound”} 

0.3 

MTS 

3 

{"rifle,"  "shot,"  "dead"} 

0.4 

MTS 

4 

{"rifle,"  "shot,"  "wound”} 

0.32 

MTS 

5 

{ "bomb,"  "explosion,"  "dead" } 

MTS 

6 

{"bomb,"  "explosion,"  "wound"} 

mm 

MTS 

7 

{"shell,"  "explosion,"  “dead"} 

0.36 

MTS 

8 

{"shell,"  "explosion,"  "wound"} 

0.32 

_ 

MTS 

9 

{"murder,"  "dead"} 

MTS 

10 

{"murder,”  "wound"} 

MTS 

11 

{"slay,"  "dead"} 

MTS 

12 

{"slay,"  "wound"} 

0.32 
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3  Review  of  Generalized  Vector  Space  Model  ( GVSM) 

In  information  retrieval,  it  is  common  to  model  index  terms  and  documents  as  vectors 
in  a  suitably  defined  vector  spaces  [8,  9].  This  approach  is  usually  called  Vector 
Space  Model  (VSM).  In  VSM,  all  the  items  of  interest  to  the  information  retrieval 
system  are  modeled  as  elements  of  a  vector  space.  Let  tp  t2,  tn be  the  terms  used  to 
index  the  documents  in  a  collection.  For  each  term,  there  is  supposed  to  be  a 
corresponding  vector  t  in  a  vector  space.  Those  vectors  { 1 1  i  =  1 , . . . ,  n }  are  considered 
as  the  generating  set  of  the  subspace  and  therefore  all  the  items  of  interests  can  be 
represented  as  a  linear  combination  of  t  s. 

Let  dp  d2,  . . .,  dm  be  the  documents  in  a  collection.  Then,  we  can  consider  each  of 
those  documents  as  a  vector  in  the  form  of 

da=  E  aa,  t,  (3.1) 

i=1,n 

where  aai  is  the  component  of  da  along  the  direction  of  the  term  t .  Similarly,  we 
can  consider  a  query  q  as  a  linear  combination  of  the  tj’s  as  follows. 

q  =  E  qj  t  (3.2) 

j=l.n 

where  q  is  the  component  of  da  along  the  direction  of  the  term  tr 

Then,  the  similarity  of  a  document  and  a  query  could  be  acquired  by  computing 
the  similarity  of  the  document  vector  and  the  query  vector  in  the  vector  space. 
Generally,  the  cosine  similarity  function  is  often  to  be  used  as  the  measure  of 
similarity,  that  is, 

da  *  q  =  2  Eaaiq  t*  t  (3.3) 

j=l.n  i=l,  n 

The  documents  in  a  collection  can  be  ranked  by  the  Retrieval  Status  Values 
(RSVs)  given  by  similarity  function  values  between  the  documents  and  the  query.  For 
this  purpose,  we  need  to  know  aai’s  and  t*  t.  Note  that  sometimes  we  may  or  may  not 
know  the  vector  representation  for  t  explicitly.  Therefore  earlier  researchers  made  the 
assumption  that  t,  and  t  are  orthogonal  if  i  ^  j .  That  is  f  •  t  =  0  if  i  Z  j.  Unfortunately 
such  an  assumption  does  not  hold  in  the  real  world.  To  solve  this  problem,  the 
Generalized  Vector  Space  Model  (GVSM)  is  proposed. 

The  main  idea  of  GVSM  is  to  incorporate  the  representation  of  elements  in  a 
Boolean  algebra  to  into  a  vector  space.  In  this  mapping,  terms  are  represented  as  a 
linear  combination  of  vectors  associated  with  the  atomic  expressions  (or  concepts) 
that  are  pairwise  orthogonal.  Let  tp  t2,  ...,  tn  denote  the  terms  that  are  used  to  index 
the  documents  in  a  collection.  An  atomic  expression  mt,  called  a  min-term,  in  the  n 
literals  tp  t2,  is  a  conjunction  of  the  literals  where  each  r  appears  exactly  once 
and  is  either  in  complemented  or  uncomplemented  form.  That  is,  mk  =  x,  AND  x2  AND 
...  AND  xn  where  x.  has  the  form  either  of  f,  or  —4,.  Since  the  number  of  all  possible 
min-terms  is  2"  and  the  conjunction  of  any  two  different  min-terms  is  always  zero 
(false),  we  can  map  the  2"  min-terms  into  the  orthogonal  bases  of  the  vector  space  R 2” 
as  follows: 


m  ,  =(1,0,0,...,0),  m2  =(0,1,0,...,0),  .... ,  m,n  =(0,0,0,...,!) 


(3.4) 


On  Modeling  of  Concept  Based  Retrieval  in  Generalized  Vector  Spaces  457 


Since  each  t.  is  itself  an  element  of  the  Boolean  algebra  generated,  t.  can  be 
expressed  in  its  disjunctive  normal  form: 

ti  =  mi:  OR  mj2  OR. . .  OR  mir  (3.5) 

where  the  m’  s  are  those  min-terms  in  which  tt  is  uncomplemented.  Let  the  set  of 
min-terms  in  Equation  (3.5)  be  denoted  by  {m}'.  We  can  now  define  basis  vectors 
analogous  to  Equation  (3.4)  and  term  tt  can  be  written  in  the  vector  notation  as 
2" 

t  =  Eq,  mk  (3.6) 

k=l 

where  unnomalized  form  of  clk  is  given  by 
cik=  S  wai 

where  D^is  a  set  {da  |  da  contains  all  the  non-negated  index  terms  in  mk  and  also 
excludes  all  the  negated  index  terms  in  mk  }.  In  the  above,  waj  is  the  importance  of 
each  term  f  in  document  da.  In  Wong  and  et  al.’s  work  [11],  they  compute  these 
weights  from  the  term  co-occurrence  frequency. 


4  Modeling  RUBRIC  Concepts  in  GVSM 

In  RUBRIC,  the  rule-base  tree  describes  certain  relationships  among  concepts. 
Specification  of  a  concept  as  a  rule-base  tree  not  only  helps  the  user  to  describe 
his/her  retrieval  request  more  accurately  and  more  flexibly,  but  also  makes  the 
resulting  rule-base  tree  more  understandable  to  other  users.  However,  in  order  to 
retrieve  a  document,  RUBRIC  requires  that  the  document  contain  all  the  index  terms 
in  one  MTS.  It  is  a  strong  provision  for  information  retrieval  systems,  even  if  each 
conjunction  presents  an  alternative  specification  of  a  query  topic.  To  overcome  this 
problem,  we  should  consider  term  correlations. 

In  GVSM,  however,  term  correlations  are  well  established  based  on  the  co¬ 
occurrence  frequency.  However  GVSM  itself  does  not  provide  a  way  for  describing  a 
query  at  a  conceptual  level,  while  the  RUBRIC  model  does.  Consequently,  it  is  a 
natural  way  to  incorporate  these  two  models  together.  The  main  issue  here  is  to 
construct  a  query  vector  in  GVSM  for  a  concept  defined  in  a  rule  base  tree. 


4.1  Mapping  MTSs  to  Query  Vectors  in  GVSM 

Now,  we  are  trying  to  map  MTSs  into  query  vectors  in  GVSM.  For  this  purpose,  we 
consider  potential  weights  of  index  terms  to  a  given  concept.  The  potential  weight 
means  the  importance  of  the  index  term  to  the  given  concept  based  on  the  assumption 
that  the  index  term  is  used  as  a  term  in  the  corresponding  query  to  the  concept.  To 
compute  the  potential  weight  of  an  index  term,  we  just  follow  the  paths  in  the  rule- 
base  tree  from  the  index  term  up  to  the  root  that  denotes  the  given  concept  and 
multiply  the  weights  in  the  paths.  For  example,  in  rule-base  tree  shown  as  Fig.  1,  the 
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potential  weight  of  index  term  “gun”  to  concept  Violent-act  will  be  0.5*1. 0*0.6*  1.0 
=  0.3.  Based  on  the  potential  weights,  we  can  consider  weighted  MTS  for  a  concept  c 
as  a  MTS  in  which  each  term  t.  is  assigned  the  potential  weight  of  t.  to  c.  For 
example,  the  weighted  MTSl  corresponding  to  MTS1  in  Table  1  is  {("gun",0.3), 
("shot",0.42),  ("dead", 0.4)}. 

To  complete  the  mapping  of  MTSs  into  query  vectors  in  GVSM,  it  is  necessary 
to  define  vector  operations  that  correspond  to  OR  and  AND  operators,  respectively.  In 
this  paper,  we  adopt  more  general  definitions  for  these  operators  as  follows.  Given  t, 
=  cnm,  +  c12m2  +  . . .  +  clpmp  and  t2  =  c21m,  +  c2,m2  +  . . .  +  c2pmp,  where  p  =  2",  t,  ©  t2  = 
£  max  (clk,  c2k)  mk  and  t,  ®  t2  =  £  and-max  (clk,  c2k)  mk,  where  and-max(x,  y)  is 
max(x,  y)  if  x  and  y  are  not  zero,  and  otherwise  zero.  These  are  analogous  to  ©  and  ® 
introduced  in  Wong  and  et  al.  [12]. 


4.2  A  Complete  Example 

In  this  subsection,  we  give  an  example  to  show  how  the  extended  RUBRIC  model 
works.  Suppose  the  rule-base  tree  is  shown  as  Fig.  2,  where  Shooting  and  Injuries 
are  defined  as  the  same  concepts  as  shown  in  Fig.  1.  The  weighted  MTSs  for  concept 
Killing  are  listed  as: 

{ “gun”(0.5*  1 ,0*0.6=0.3),  “shot”(0.7*0.6=0.42),  “dead”(  1 .0*0.4=0.4) } 

{“gun”  ”(0.5*1.0*0.6=0.3),  “shot”  (0.7*0.6=0.42),  “wound”  (0.8*0.4=0.32)} 
{“rifle”  ”(0.7*1.0*0.6=0.42),  “shot”  (0.7*0.6=0.42),  “dead”  (1.0*0.4=0  4)) 

{“rifle”  ”(0.7*1.0*0.6=0.42),  “shot”  (0.7*0.6=0.42),  “wound”  (0.8*0.4=0.32)} 


Killing 


Shooting 


0.4 


Injuries 


Fig.  2.  An  example  rule-base  tree 


Suppose  the  Document-term  matrix  and  the  basis  vectors  are  shown  as  Table  2  and 
Table  3,  respectively. 


Table  2.  Document-term  matrix 


wound 

dead 

shot 

rifle 

gun 

dl 

2 

1 

3 

d2 

1 

2 

2 

d3 

4 

0 

1 

1 

d4 

2 

3 

4 

0 

d5 

1 

2 

2 

2 

4 

d6 

2 

1 

4 

0 

1 
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Table  3.  Basis  vectors 

wound  dead  shot  rifle  gun  basis  vector 
m0  0  0  0  0  0  m0 

m,  0  0  0  0  1  m, 


m31  11111  m3] 

Term  vectors  can  be  calculated  as: 

wound  =  0.92  m19  +  0.15  m28  +  0.30  m,9  +  0.15  m31 
dead  =  0.55  m  l4  +  0.55  m2g  +  0.27  m29  +  0.55  m31 

shot  =  0.52  m  14  +  0.34  m28  +  0.69  m29  +  0.34  m31 

rifle  =  0.81  ml4  +  0.40  m19  +  0.40  m3l  gun  =  0.69  m19  +  0.17  m29  +  0.69  m31 

Using  the  Equation  (3.6),  we  can  compute  the  term  vectors.  For  example,  we  can 

get  the  coefficient  of  m  19  in  term  vector  wound  as  follows.  Since  the  Boolean  pattern 
corresponding  to  m  19  is  10011,  only  documents  dl  and  d3  are  relevant  to  this  pattern. 
Therefore,  m  19’s  unnormalized  coefficient  is  2  +  4  =  6.  Similarly,  we  can  get  the 
unnormalized  coefficients  of  m2S,  m29,  and  m31  are  1,  2,  and  1,  respectively.  Now,  we 
normalize  the  coefficient  of  m  19  as  6  /  (62  +  l2  +  22  +  l2)  =  0.92.  Using  these  term 
vectors,  we  can  compute  document  vectors.  For  example,  since  d,  =  2  wound  +  1 
rifle  +  3  gun  from  the  document-term  matrix  in  Table  2,  d,  =0.81  m  14  +  4.31  m  l9  + 
0.30  m2S  +  1.11  m29  +  2.77  m3l. 

For  expressing  queries,  we  use  the  operators  ©  and  ®  introduced  in  section  4.1. 
For  concept  Killing,  if  we  choose  MTS1  =  {(gun,  0.3),  (shot,  0.42),  (dead,  0.4)}  as  a 
query,  then  we  can  construct  the  query  vector  as  follows. 

q  =  [0.3*(0.69  m19  +  0.17  m29  +  0.69  m31)]  ® 

[0.42*(0.52  m14  +  0.34  +  0.69  m29  +  0.34  m31)]  ® 

[0.4*(0.55  m14  +  0.55  +  0.27  m29  +  0.55  m31)]  =  0.29  +  0.22  m31 

For  concept  Killing,  if  we  choose  the  following  two  MTSs  as  a  query: 

MTS1  =  {(gun,  0.3),  (shot,  0.42),  (dead,  0.4)}, 

MTS2  =  {(rifle,  0.42),  (shot,  0.42),  (dead,  0.4)} 

Then  the  corresponding  vector  q  is  computed  as  follows,  q  =  q3  ©  q2,  where  q,  and 
q2  are  query  vectors  for  MTS  1  and  MTS2,  repectively. 

q,  =  0.29  m29  +  0.22  m31 

q2  =  {[0.42*(0.81  m14+  0.40  m19  +  0.40  m31)]  ® 

[0.42*(0.52  m14  +  0.34  m28  +  0.69  m29  +  0.34  m31)}  ® 

[0.4*(0.55  mu  +  0.55  m.s  +  0.27  m2S  +  0.55  m,,)] )  =  0.34  m„  +  0.22  m3] 

q,  ®  q2  =  0.34  m,4  +  0.29  mj9+  0.22  m,, 
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5  Experimental  Design  and  Results 


For  the  experimental  test,  we  choose  Google  search  engine  in  Internet  [13].  It  uses  a 
conjunction  of  keywords  as  queries  and  returns  a  set  of  links  to  web  pages.  We  use 
the  rule  base  tree  shown  in  Fig.  1  for  our  experimental  test.  From  Fig.  1,  we  can 
compute  12  weighted  MTSs  and  construct  a  query  for  each  MTS.  For  each  query,  we 
choose  the  top  20  links  retrieved  by  Google.  Since  some  links  to  the  web  pages  might 
have  been  changed  or  moved  after  being  indexed  by  Google ,  we  must  eliminate  those 
links.  Finally,  we  got  a  collection  of  196  documents.  The  relevance  judgments,  for 
evaluation  purpose,  are  determined  by  looking  through  each  document  and  choosing 
those  document  related  to  the  concept  Violent-act.  From  the  collected  documents,  we 
can  construct  the  document-term  matrix.  Using  the  matrix,  we  can  also  compute  the 
term  vectors  and  document  vectors.  Since,  with  respect  to  each  MTS,  RSV  (similarity 
between  a  query  and  a  document)  is  generated  for  each  document,  further  processing 
is  done  to  combine  them  into  a  single  RSV  for  each  document  satisfying  more  than 
one  MTS.  For  this  purpose,  the  user  can  be  given  the  option  of  performing  the 
disjuction  operation  over  some  or  all  the  MTSs.  Thus,  the  processing  done  by  our 
system  for  ranking  requires  post-processing  of  results  from  Google. 

In  our  experiment,  we  select  all  the  MTSs  and  compute  the  performance  in  terms 
of  recall-precision.  We  conduct  experiments  using  two  different  disjunction  operators, 
namly  +  and  ©.  We  call  the  former  Extended  Rubric  version  1  (ER1)  and  the  latter 
Extended  Rubric  version  2  (ER2). 


Table  4.  Recall-Precision  using  the  original  RUBRIC  approaches  (R1  and  R2)  and  the 
extended  RUBRIC  approaches  (ER1  and  ER2) . 


Recall 

Precision 

R1 

R2 

ER1 

ER2 

0.1 

0.7143 

0.5882 

0.6250 

0.7143 

0.2 

0.6250 

0.5882 

limt 

0.6667 

0.5686 

1 

1 

0.5672 

0.5672 

0.5875 

0.6528 

0.6 

0.6364 

0.6154 

0.7 

0.6019 

0.5328 

0.5462 

0.8 

0.5362 

0.5175 

0.5175 

0.9 

0.5030 

0.5390 

0.5155 

1.0 

0.4742 

0.4742 

0.4769 

0.4792 

In  order  to  compare  the  extended  versions  with  the  original  RUBRIC  approaches, 
we  conduct  the  experiment  using  the  versions  of  the  original  RUBRIC,  denoted  by  R1 
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and  R2,  respectively.  In  Rl,  we  adopt  the  original  idea  of  RUBRIC.  That  is,  we  use 
max  operator  and  min  operator  for  disjunction  and  conjunction,  respectively.  In  R2, 
we  use  the  same  operator  (that  is,  min  operator)  for  conjunction  and  use  an  operator 
different  from  max  operator  for  disjunction.  In  R2,  we  adopt  the  weighted  operator  for 
disjunction  [5].  Suppose  that  document  d  is  satisfied  by  three  weighted  MTSs  M,,  M,, 
and  M3.  And  suppose  that  weights  w,,  w2,  and  w3  are  assigned  to  M,,  M2,  and  M3, 
respectively,  and  w2  is  the  maximum  value  among  w,,  w2,  and  w3  Then,  we  compute 
the  RSV  of  d  in  the  following  weighted  manner:  RSV  =  w,  +  2w2  +  w3  The 
performance  of  ER1,  ER2,  Rl,  and  R2  are  given  in  Table  4. 


6  Conclusions 

From  the  experiments,  we  can  reach  to  the  following  conclusions  about  the  method  of 
evaluating  documents  relative  to  a  rule-base. 

(1)  The  variation  of  RUBRIC  (R2)  does  not  improve  the  accuracy  of  RSVs. 

(2)  The  extended  approach  ER1  shows  a  similar  performance  as  the  original  RUBRIC 

approach  Rl 

(3)  ER2  shows  a  better  performance  than  ER1. 

Our  approach  can  be  extended  in  several  ways.  First  of  all,  the  term  weight  is 
not  necessarily  frequency  information.  For  example,  some  existing  search  engines 
provide  weights  or  scores  for  retrieved  documents.  In  that  case,  if  a  single  term  is  sent 
as  the  query  to  the  search  engine,  the  weight  or  ranking  assigned  to  a  document  can  be 
treated  as  the  weight  of  the  term  in  the  document.  There  are  other  interesting  issues 
currently  being  investigated,  related  to  the  RUBRIC  approach.  For  example,  based  on 
relevance  feedback,  we  can  adjust  the  weights  both  in  the  rule  base  tree  and  the 
document-term  matrix  [3],  Furthermore,  with  the  help  of  some  knowledge  bases,  the 
user  can  construct  the  rule  base  tree  more  automatically  [4]. 
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Abstract.  It  is  common  that  a  text  document  contains  information  that 
can  be  interpreted  as  instructions  to  pursue  a  given  task.  This  informa¬ 
tion  called,  pattern,  can  be  seen  as  the  triggering  mechanism  for  a  set  of 
predefined  operations.  We  are  interested  in  automating  the  recognition  of 
these  patterns  for  repetitive  tasks.  We  introduce  the  notion  of  template 
generation  which  allows  for  the  recognition  of  new  patterns  that  trigger 
operations.  We  implemented  an  algorithm  for  template  generation  and 
we  tested  it  in  an  electronic  publishing  application.  The  tests  show  that 
some  characteristics  of  the  processed  text  can  be  used  to  adapt  the  gen¬ 
eration  process  and  obtain  templates  that  provide  better  precision  and 
recall. 


1  Introduction 

Today,  most  operations  on  text  such  as  classification  or  editing,  are  still  done 
manually.  Some  of  these  operations  are  repeatedly  performed  on  a  daily  basis,  e.g. 
in  e-publishing  or  document  management.  In  these  cases,  it  is  common  that  a  text 
document  contains  information  that  can  be  interpreted  as  instructions  to  pursue 
a  given  task.  This  information  can  be  seen  as  the  triggering  mechanism  for  a  set  of 
predefined  operations.  We  use  the  common  name  of  pattern  for  this  information. 
A  pattern  can  appear  in  different  forms  such  as  word,  phrase,  sentence,  etc. 
The  following  are  examples  of  repetitive  tasks  on  text  which  contain  implicit  or 
explicit  instructions: 

-  Establishing  an  interaction  in  natural  language  with  a  computer.  For  exam¬ 
ple,  a  command  like  “turn  on  the  light  in  the  kitchen”  is  a  written 
instruction  that  could  be  given  on  a  regular  basis  to  an  intelligent  device  in 
charge  of  the  environmental  control  of  a  house. 

-  Classifying/routing  documents.  For  example,  an  electronic  message  that 
contains  the  pattern  “Mr  Hill  wants  to  change  the  carburetor  of  his 
car”  is  forwarded  to  the  team  in  charge  of  engine  repair,  while  the  mes¬ 
sage  which  contains  the  pattern  “Mrs  Jones  wants  to  change  the  door 
of  her  van”  is  forwarded  to  the  team  in  charge  of  chassis  repair. 

-  Transforming  textual  information  into  data.  For  example,  the  string  “let  ’  s 
meet  at  3  pm  on  Wednesday  in  my  office”  written  in  a  message  can  be 
an  implicit  instruction  to  create  and  insert  a  meeting  appointment  in  a  cal¬ 
endar. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  463-473,  2000. 
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—  Performing  an  editing  operation  to  fit  a  policy.  For  example,  in  an  electronic 
journal,  the  references  to  articles  like  “in  the  IJCAI  paper  of  Harris” 
will  be  transformed  into  a  link  to  the  actual  article. 

Typically,  the  properties  shared  by  these  repetitive  tasks  are: 

—  No  initial  corpus  is  provided  for  training.  For  example,  to  produce  a  training 
corpus,  the  editor  of  a  journal  would  need  to  accumulate  hundreds  of  articles 
and  thus  would  produce  tens  of  issues  before  getting  any  automatic  support. 

—  The  set  of  patterns  to  be  identified  can  change.  This  can  be  due  to  the 
modification  of  the  users  habits  or  needs,  to  the  modification  of  the  way  to 
express  instructions  in  the  texts,  etc. 

—  Some  background  knowledge  is  needed  to  perform  the  tasks.  The  background 
knowledge  can  be  domain  specific  concepts  or  information  on  the  structure 
of  the  texts1.  For  example,  the  editor  of  ETAI  [10],  an  electronic  journal, 
uses  a  knowledge  base  of  authors  and  references  to  articles  from  which  an 
ontology  can  be  built.  In  this  paper  we  assume  the  existence  of  a  basic  set 
of  concepts  for  each  task. 

Automating  the  operations  on  text  by  establishing  a  human-computer  collab¬ 
oration  would  save  much  human  effort  and  material  resources.  Our  work  aims 
at  automating  the  generation  of  templates  which  represent  and  are  used  for  the 
identification  of  patterns  that  trigger  the  same  operation.  The  means  to  rec¬ 
ognize  patterns  range  from  the  recognition  of  syntactic  characteristics  of  the 
patterns  (e.g.  keywords,  grammar  rules)  to  the  recognition  of  their  semantical 
characteristics  which  implies  reasoning  about  the  meaning  of  words  and  their 
context  (e.g.  conceptualization,  sentence  understanding).  In  this  paper  we  intro¬ 
duce  a  method  for  template  generation  to  support  the  semantical  identification 
of  textual  patterns  in  repetitive  tasks.  Template  generation  is  performed  after 
each  execution  of  a  given  task  to  incrementally  build  the  set  of  templates  that 
identify  all  patterns  triggering  an  operation. 

Given  a  pattern  p,  template  generation,  TG,  creates  a  template  t  by  substituting 
specific  information  in  p  with  less  specific  information  (i.e.  concepts).  The  substi¬ 
tutions  are  made  through  a  number  of  rewritings  according  to  the  syntactic  and 
semantical  rules  of  a  grammar  G.  A  concept  describes  a  characteristic  of  a  piece 
of  pattern  and  can  be  of  different  categories  like  semantic,  grammatical  role,  lan¬ 
guage,  etc.  Thus,  given  the  (semantic)  concepts  “<action>”,  “<device>”  and 
“<room>”,  and  the  pattern  p  =  “turn  on  the  light  in  the  kitchen”,  TG 
can  rewrite  p  into  t  =  “<action>  the  <device>  in  the  <room>” .  Notice  that 
it  is  possible  to  generate  several  templates  from  a  given  pattern.  For  example,  TG 
can  also  create  the  template  t'  =  “<action>  the  light  in  the  <room>”  from 
p.  Given  a  pattern  p  which  triggers  the  operation  o,  we  denote  by  Tp  =  {fi, ...,  tn} 
the  set  of  templates  that  can  be  created  by  TG,  where  each  template  identifies 
a  different  set  of  patterns  =  p£  U  PtT.  The  elements  of  P£  trigger  o  while 


1  Knowledge  acquisition  techniques  such  as  data  mining  or  interview  of  experts  can 
support  the  construction  of  domain  specific  knowledge. 
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the  elements  of  PtT  do  not  trigger  o.  In  this  paper  we  consider  the  case  where 
p  is  rewritten  into  a  unique  template  t  such  that  (1)  t  identifies  only  patterns 
which  trigger  o  (i.e.  Pt  =  Pt+)2,  (2)  t  is  more  general  or  equal  to  the  template 
that  identifies  only  p  (i.e.  t  €  Tp),  and  (3)  t  is  one  of  the  most  general  templates 
that  satisfy  (1)  and  (2). 

In  section  2  we  describe  the  characteristics  of  the  ontology  of  concepts  usable 
by  TG.  In  section  3,  we  introduce  our  solution  for  template  generation:  pattern 
generalization  (PG).  In  section  4  we  give  the  task  scenario  that  we  use  to  evalu¬ 
ate  our  method.  In  section  5  we  describe  our  tests  results.  In  section  6  we  relate 
template  generation  to  other  work.  The  paper  concludes  in  section  7. 


2  An  Ontology  for  Template  Generation 

TG  is  based  on  the  possibility  to  capture  the  semantics  of  a  pattern.  This  is 
done  by  substituting  pieces  of  the  pattern  with  relevant  concepts.  For  exam¬ 
ple,  given  the  pattern  p  =  “the  article  published  in  1JCAI-99”,  and  the 
concept  <conf erence>,  “IJCAI-99”  can  be  substituted  by  <conference>  to 
produce  the  template  t  =  “the  article  published  in  <conf  erence>”.  It  is 
then  possible  to  recognize  all  the  patterns  syntactically  similar  to  p  which  express 
a  reference  to  a  paper  published  in  any  conference.  If  some  of  these  patterns  do 
not  trigger  the  creation  of  a  link,  the  concept  maybe  too  general  and  has  to  be 
replaced  by  a  more  specific  one  (e.g.  <AI-conf  erence>).  Such  substitutions  are 
possible  if  there  is  an  ontology  which  describes  the  concepts  expressed  in  the 
processed  texts.  In  [5]  Guarino  and  Giaretta  point  out  that  the  word  “ontology” 
is  ambiguous.  We  adopt  the  definition  commonly  used  in  knowledge  engineering 
[1]:  an  ontology  is  a  set  of  specifications  of  conceptualizations,  where  a  conceptu¬ 
alization  is  a  set  of  definitions  of  elements  (concepts)  in  a  domain.  So,  to  describe 
an  ontology,  we  should  provide:  (1)  the  descriptions  of  the  concepts,  and  (2)  the 
relations  between  the  concepts  (examples  of  common  relations  are  is-a,  part-of, 
consist-of,  etc).  The  concepts  and  the  relations  aim  at  describing  the  piece  of  the 
real  world  that  a  given  application  needs  to  know.  The  is-a  relation  is  impor¬ 
tant  because  it  allows  comparing  the  concepts  in  terms  of  generality.  When  it 
comes  to  providing  definitions  of  concepts,  we  notice  (1)  the  existence  of  distinct 
categories  of  concepts  to  characterize  pieces  of  text  (i.e.  semantic,  grammatical 
role,  presentation,  language,  etc.)  and  (2)  the  possibility  to  describe  concepts 
in  several  ways  (i.e.  natural  language  description,  value  enumeration,  grammar, 
process,  etc.).  The  existence  of  several  categories  of  concepts  augments  the  num¬ 
ber  of  potential  substitutions  for  a  given  piece  of  pattern  and  complicates  the 
process  of  the  creation  of  a  template.  Moreover,  to  be  able  to  perform  the  sub¬ 
stitution  of  a  piece  of  pattern  by  a  concept,  TG  should  be  able  to  understand 
the  different  means  of  descriptions  used  by  the  ontology.  This  problem  is  outside 
the  scope  of  this  paper.  Figure  1  gives  an  example  of  ontology  usable  by  TG. 

2  There  is  always  a  solution  since  the  template  that  only  identifies  the  pattern  p  fulfills 
the  requirements. 
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<grammatical_concept>  <semantical_concept> 

/  /  /  \\ 

<pait-of-speech>  <conference>  <author>  <editor>  <joumal> 

/  / 

<AI-conference>  <AI-author> 


c2  is-a  cl 


Fig.  1.  An  example  of  ontology  for  a  task  of  journal  editing 


3  The  Process  of  Pattern  Generalization 

In  this  section,  we  describe  the  process  of  pattern  generalization  (i.e.  PG),  our 
solution  for  template  generation.  PG  consists  of  rewriting  pieces  of  a  pattern  p 
into  concepts  according  to  the  rewrite  rules  of  a  grammar. 

—  Tp  is  the  set  of  templates  that  can  be  created  from  a  pattern  p. 

—  We  define  the  order  >g  on  Tp  such  that: 

VfiVfj  :  Pu  D  Ptj  A  Pu  ±  Ptj  =>  ti  > g  tj. 

If  ti  >g  tj  we  say  that  ti  is  more  general  than  tj  and  that  tj  is  more  specific 
than  ti. 

—  The  process  of  PG  is  the  function  PG(C,  Txt,  p,  (?)  return  t  where 

•  C  is  the  ontology  (cf.  section  2). 

•  Txt  is  the  set  of  texts  already  processed  where  the  pieces  of  text  that 
are  not  patterns  are  annotated. 

•  p  is  a  pattern  which  triggers  an  operation  o. 

•  t  is  a  more  general  template  of  Tp  that  can  be  used  to  recognize  only 
patterns  that  trigger  o:  Pt  =  Pt+. 

•  G  is  a  grammar  (cf.  table  3).  A  pattern  is  a  string  which  forms  either 
a  sentence  (S-struct)  or  a  phrase  (P)3,  and  a  template  is  a  sequence  of 
strings  and  concepts.  G  is  composed  of  three  kinds  of  rules: 

1.  “SubPart  ->  Concept”  defines  the  non  terminal  SubPart  as  a  piece 
of  pattern  that  can  be  substituted  by  a  concept. 

2.  “X  ->  X_struct  I  SubPart”  expresses  the  possibility  of  transforming 
a  piece  of  pattern  X  into  either  a  “legal  English  structure”  of  X  or  a 
SubPart. 

3.  “X_struct  ->  ...  1  ...  ”  describes  the  possible  “legal  English  struc¬ 
tures”  of  X.4 

The  grammar  is  ambiguous  at  two  levels:  (1)  it  is  possible  to  create 
several  parse  trees  (i.e.  chains  of  rewrite  rules)  for  a  given  pattern  p,  and 
(2)  it  is  possible  to  associate  several  concepts  to  the  same  SubPart. 


3  In  the  grammar  we  restrict  the  patterns  to  the  forms  of  “legal  English  structures”  of 
sentences  or  phrases  to  allow  the  usage  of  NLP  analyzers  which  require  grammatically 
correct  input.  The  principle  of  PG  can  handle  any  piece  of  text. 

4  These  rules  can  be  completed  to  parse  more  legal  English  structures  of  sentences, 
phrases,  noun  phrases,  and  verb  phrases. 
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S-struct  ->  P  I  P  PREP  P  1  HP  VP  |  . 
P  ->  P -Struct  |  SubPaxt 
P -struct  ->  NP  I  VP  1  ... 

NP  ->  NP_struct  |  SubPart 
NP  .struct  ->  N  I  ADJ  N  I  ART  ADJ  N  I 
VP  ->  VP -struct  |  SubPart 
VP -struct  ->  V  I  V  NP  I  ... 

N  ->  N-Struct  |  SubPart 
V  ->  V-Struct  I  SubPart 
ADJ  ->  ADJ-Struct  I  SubPart 
ART  ->  ART-Struct  I  SubPart 
SubPart  ->  Concept 


S-struct:  sentence,  P:  phrase, 

PREP:  preposition,  ADJ:  adjective, 
ART:  article,  N:  noun,  V:  Verb. 

Concept  is  a  terminal. 

Its  instances  belong  to  C. 

N-Struct ,  V-Struct ,  ADJ-Struct , 
ART-Struct  ere  terminals. 

Their  values  are  given  by  a  lexicon. 


Table  1.  Ambiguous  grammar  for  rewriting  a  pattern  into  a  template 


The  process  of  computing  PG()  is  composed  of  the  two  following  steps: 

1.  BuildGeneralTemplate (p ,  C)  return  tg, 

where  tg  is  such  that:  ( tg  €  Tp)  A  (Vf,  €  Tp  :  ->(ti  >g  tg)) 

Building  tg  implies  the  choice  of  the  derivation  which  rewrites  p  into  a  most 
general  template.  Table  2  gives  the  rules  to  make  this  choice5. 

2.  Ref  ineTemplate  (t5 ,  C ,  Txt )  return  t  such  that  (Pt  =  Pt+). 

Ref ineTemplateO  =  {For  each  p~  €  Pt“  Do  Refine(£9,  p,  p~ ,  C)}. 
Refine  ()  applies  on  tg  the  most  specific  derivation  which  provides  the  most 
general  template  that  does  not  match  p~.  It  is  done  in  three  steps. 

2.1  DifferentSubparts(f9,  p,  p~ )  return  S,  where  S  is  the  set  of  subparts 
of  tg  such  that  Vs  €  5  the  form  of  s  is  a  concept  in  tg,  and  the  form  of  s  in  p 
and  in  p~  are  different6.  E.g.  s  =  “<author>”  in  tg,  s  =  “Harris”  in  p  and  s  = 
“Smith”  in  p~ . 

2.2  OrderedDerivations(S,  C )  return  D. 

Vs  6  S,  s  is  a  concept  and  there  are  two  kinds  of  derivations  possible  to  make 
s  more  specific:  (dl)  s  concept  where  concept  is-a  s  and,  (d2)  s  =4-  X  => 
X_struct  =4>  s'  =>  g  where  g  is  the  most  general  form  of  s'. 

Table  3  gives  the  rules  to  choose  the  most  specific  derivations  for  a  given  subpart 
(i.e.  a  concept).  D  is  the  set  of  combinations  of  derivations  on  s  where  the 
derivations  are  stored  from  more  general  to  more  specific. 

2.3.  Enumerate  (tg,  S ,  D ,  C,  p~)  return  t  =  {For  each  d  in  D  Do 
t  <—  ApplyDerivation(tg,  d)  ;  If  p~  ^  Pt  then  return  t} 


5  Currently,  for  the  cases  1  and  3  of  rule  2,  we  use  the  first  heuristic.  More  investigations 
need  to  be  done  on  the  second  heuristic. 

6  This  selection  limits  the  space  of  the  derivations  to  those  which  focus  on  the  elimi¬ 
nation  of  p~  from  Ptg . 
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1.  X  ->  X.struct  I  SubPart 

Heuristics:  the  conceptualization  of  a  subpart  is  more  general  than  the  sequence 
of  conceptualization  of  its  subparts.  Thus, 

(p  =^>  A  X  B  =S>  A  SubPart  B  =4  ii)  A  (p  4  A  X  B  =4  A  Xjstruct  B 
=>  ti)  =>•  fi  >g  f2. 

E.g.  given  X  =“the  Oxford  University  Press”, 

(p  ==>  A  X  B  ==>  A  SubPart  B  =>  A  <editor>  B  ==4  fi)A 

(p  ==>■  A  X  B  =$■  A  X-Struct  B  ==>  A  “the  <university>  Press”  B  =4>  t2) 

=>  ti  >s  t2- 

2.  SubPart  ->  Concept 

Given  the  set  Cs  =  {ci,c2,  ...,c„}  of  concepts  associated  to  the  subpart  s,  the 
rewrite  rule  becomes  s  ->  ci  I  c2  I  ...  I  cn 
Case  1:  some  concepts  belong  to  different  categories  of  s. 

E.g.  s  =  “Harris”  A  Ci  =  <author>  A  c2  =  <personal-noun>  A 
ci  is-a  <semantic_concept>  A  c2  is-a  <graminatical_concept>. 

Two  heuristics:  (1)  Defining  an  order  among  the  categories  (E.g. 
<semantic_concept>  >9  <grammatical_concept>),  and  (2)  Consider  all  pos¬ 
sible  orders  and  generate  a  template  for  each. 

Case  2:  there  is  an  is-a  relation  between  some  concepts. 

3 (ci,Cj)  £  C2  :  ( d  is-a  cj)  A  (p  A  Subpart  B  =>•  A  a  B  ti)  A 
(p  A  Subpart  B  =>  A  Cj  B  ==>  t2)  =$•  (f2  >g  fi). 

E.g.  s  =  “Harris”  A  Ci  =  <AI-author>  A  c2  =  <author>  A 

(<AI-author>  is-a  <author>)  A 

(p  A  s  B  =>  A  <AI-author>  B  ==>■  <i)  A 

(p  A  s  B  =>  A  <author>  B  ==>  f2)  =4-  (f2  >9  ti). 

Case  3:  some  concepts  belong  to  the  same  category  and  there  is  no  relation  is-a 
between  them.  E.g.  s  =  “Brown”,  ci  =  <author>  and  c2  =  <university>  A 
Ci  is-a  <semantic_concept>  A  c2  is-a  <semantic_concept>. 

There  is  only  one  concept  which  is  true. 

Two  heuristics:  (1)  Picking  one  concept  randomly  to  generate  t,  and  (2)  Gen¬ 
erating  one  template  for  each  concept. 

Table  2.  Rules  to  choose  a  most  general  derivation 


1.  Vc  €  C  :  ApplyDerivationfc ,  d2)  >g  ApplyDerivationfc ,  dl). 

2.  V(Cf, c^)  €  C2  : 

Ci  is-a  Cj  =>  ApplyDerivationfc,  dl(cy))  >g  ApplyDerivationfc,  dl(ci)) 
where  ApplyDerivation(ci,  dl(c2))  replace  ci  with  c2. 

3.  Vc  6  C  :  3si  :  si  =  ApplyDerivationfc,  d2)  A 

3s2  :  s2  =  ApplyDerivationfc,  d2)  A  (si  =4-  s2)  =>  si  >g  s2 

E.g.  (c  =  <editor>  A  si  =  <university>  Press  A  s2  =  <town>  university  Press 

A  S3  =  Oxford  university  Press)  =>  (si  >g  s2  >g  S3) 

Table  3.  Rules  to  choose  the  most  specific  derivation 
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4  Test  Platform 

We  identified  the  need  for  support  for  pattern  identification  in  the  task  of  the 
editor  of  ETAI  [10],  an  electronic  journal  for  artificial  intelligence.  While  our 
method  is  adaptable  to  other  domains,  it  is  on  this  task  that  we  have  performed 
the  evaluation  of  our  methods  of  template  creation  and  management  (cf.  sec¬ 
tion  5).  Most  of  the  examples  which  have  illustrated  our  explanations  are  also 
examples  from  this  task.  The  policy  of  the  journal  includes  the  public  and  collab¬ 
orative  reviewing  of  the  articles  submitted  for  publication.  The  task  of  the  editor 
is  to  compile  the  email  messages  written  by  researchers  to  review  a  given  article, 
into  a  web  page.  Each  message  is  processed  in  its  order  of  arrival.  The  messages 
can  contain  references  to  articles.  One  of  the  operations  is  to  transform  the  pieces 
of  text  that  refer  to  articles  into  links  to  the  actual  articles.  Thus  in  this  task, 
the  patterns  to  recognize  are  the  references  to  articles,  and  the  texts  to  inspect 
are  bodies  of  messages.  Example  of  patterns  are  “in  this  paper”  and  “in  the 
IJCAI  paper  of  Harris”.  The  ontology  for  this  task,  as  described  in  figure  1,  is 
composed  of  several  <semantical_concept>  which  appear  in  references  to  arti¬ 
cles  (i.e.  <conf erence>,  <author>,  <editor>,  <journal>,  <university>),  and 
of  one  <grammatical_concept>:  the  <part-of-speech>.  Each  word  of  the  pat¬ 
terns  is  associated  to  a  <part-of-speech>.  Some  words,  alone  or  in  sequence, 
are  also  associated  to  a  <semantical_concept>. 

5  Practical  Analysis  of  Pattern  Generalization 

Characteristics  of  the  ontology  (e.g.  the  order  on  the  categories  of  concepts)  and 
of  the  patterns  (e.g.  the  similarity  between  the  patterns  to  identify  and  the  pat¬ 
terns  to  not  identify)  influence  the  granularity  of  the  generated  templates,  and 
thus  their  performances  in  terms  of  precision  and  recall.  Our  first  test  illustrates 
these  characteristics.  In  a  second  test,  we  evaluate  the  algorithm  in  terms  of 
precision  and  recall  and  show  that  the  generalization  of  templates  provides  a 
better  recall  than  a  method  which  incrementally  collects  the  patterns  and  does 
not  perform  any  generalization. 

First  test: 

We  have  implemented  the  algorithm  for  pattern  generalization  which  assumes  the 
existence  of  a  predefined  order  “>g’  on  the  categories  of  concepts  (cf.  heuristic  1 
for  case  1  of  the  rule  2  given  in  table  2).  In  our  test  ontology,  there  are  two  cate¬ 
gories  of  concepts,  <semantic_concept>  and  <gramraatical_concept>,  thus 
their  are  two  possible  orders.  The  purpose  of  this  test  is  to  observe  how  the  order 
of  the  categories  of  concepts  influences  the  form  of  the  generated  templates.  From 
the  observation  we  propose  heuristics  to  choose  one  order  that  will  contribute 
to  a  quicker  generation  of  templates  that  provide  good  recall  and  precision. 
Given  two  sets  of  patterns  P~  =  {p^,  ...,pj~0}  and  P+  —  {.Pi,  ...,pio},  we  per¬ 
formed  PG  and  got  the  set  of  templates  T  =  {tPl,...,tPl0}.  PG  was  performed 
twice  with  two  different  biases  on  the  ontology.  Since  the  10  templates  produced 
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had  the  same  characteristics,  we  illustrate  the  generalization  for  a  single  pattern: 
p  —  “the  results  by  Hill  and  Smith”. 

Bias  1:  <grammar_concept>  >g  <semantic_concept>. 

Result:  The  elements  of  T  are  mainly  sequences  of  part  of  speech  tags. 
tbiasi  =  “<DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP>”. 

Bias  2:  <semantic_concept>  >g  <grammar_concept>.  Result:  The  elements  of 
T  are  sequences  of  semantical  concepts  and  part  of  speech.  The  part  of  speech 
are  mostly  conceptualizations  of  articles  and  verbs. 

tbias2  =  “<DT>  <NNS>  <IN>  <author>  <CC>  <author>”.  The  templates 
created  with  Bias  2  are  more  specific  to  the  task  and  likely  to  be  more  accurate. 
This  test  highlights  some  new  parameters  which  influence  the  construction  of 
the  templates: 

—  The  order  between  the  categories  of  concepts. 

—  The  specificity  of  the  concepts  to  the  task.  For  example,  <grammar_concept> 
is  less  specific  to  the  task  than  <semantic_concept>. 

—  The  number  of  refinements  performed  on  the  templates  which  identify  pat¬ 
terns  that  should  not  be  identified. 

—  The  syntactic  and  semantic  similarity  between  elements  of  P+  and  elements 
of  P~ :  the  more  the  elements  of  P+  look  like  the  elements  of  P~ ,  the  higher 
the  granularity  of  the  ontology  should  be. 

Given  automatic  techniques  to  observe  these  parameters,  one  can  design  an  al¬ 
gorithm  which  adapts  PG  to  the  domain  of  the  task. 

Second  test:  evaluation  of  PG  in  terms  of  recall  and  precision. 

Problem  description: 

—  The  texts  processed  are  two  e-mail  discussions.  The  first  discussion  contains 
14  mails  and  the  second  contains  8  mails. 

—  the  order  >g  on  the  concepts  of  the  ontology  is  completed  with  the  bias  1: 
<granmiar^concept>  >g  <semantic_concept>. 

—  The  set  of  templates  created  during  the  processing  of  the  first  discussion 
continues  its  evolution  during  the  processing  of  the  second  discussion. 

Evaluation  data: 

—  N+ :  the  number  of  elements  of  P+  recognized  with  the  templates  while 
processing  the  ith  message.  In  brackets:  the  number  of  patterns  which  are 
copies  of  pattern  encountered  in  previous  messages. 

—  N~:  the  number  of  elements  of  P~  recognized  with  the  templates. 

—  N( :  the  number  of  element  of  P+  that  have  not  been  recognized  with  the 
templates. 

—  N™ew:  the  number  of  new  patterns  encountered  in  the  ith  message. 

Results: 

—  Table  4  gives  the  evaluation  data  and  the  recall  and  precision  rates  for  each 
discussion. 
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—  For  the  first  discussion,  Nf  —  0,  thus  the  templates  are  never  too  general. 
Recall  does  not  exceed  21.4%  because  there  is  a  large  syntactic  difference 
between  the  patterns. 

—  For  the  second  discussion,  the  high  Recall  is  not  due  to  the  generalization 
process  but  to  the  high  usage  of  identical  patterns  in  the  discussion.  For 
example,  the  pattern  “the  paper”  occurs  6  times.  The  low  precision  is  due 
to  the  high  level  of  generality  (sequences  of  part  of  speech)  of  some  templates. 
For  example  the  template  “<DT>  <NN>”  generated  from  the  pattern  “the 
paper”  after  the  processing  of  the  4th  message  is  responsible  for  most  of  the 
8  wrong  recognitions  performed  in  the  5th  message.  The  templates  in  fault 
are  immediately  adapted:  100%  of  the  wrong  recognition  on  this  discussion  is 
done  on  the  5th  mail  of  the  discussion,  but  none  is  done  on  later  processing, 
while  correct  recognitions  continue  to  take  place. 

—  By  comparison  with  a  system  which  incrementally  builds  a  set  of  patterns 
to  recognize  and  perform  the  automatic  recognition  of  exact  copies  in  new 
texts  processed,  PG  does  augment  the  coverage  of  recognition  of  66%  in  the 
first  discussion,  and  of  18%  in  the  second.  This  difference  of  performance  is 
explained  by  the  fact  that  the  semantical  concepts  are  well  adapted  to  the 
semantic  carried  by  the  pattern  of  the  first  discussion,  but  not  to  the  second. 
For  the  second  discussion,  the  addition  of  a  concept  <publication>  corre¬ 
sponding  to  words  like  “paper”,  “journal”  or  “article”  would  have  increased 
the  performance  of  the  generalization. 
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Table  4.  Tests:  Evaluation  data,  recall  and  precision  for  the  discussions 


To  conclude,  the  experiment  shows  that  the  generalization  is  immediately 
useful  if  some  parameters  are  properly  set.  The  most  important  settings  being  (1) 
having  a  fairly  complete  ontology  and  (2)  having  a  management  of  the  categories 
of  concepts  adapted  to  the  domain  of  the  task.  We  believe  that  these  needs  can 


be  met  to  some  reasonable  extent. 
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6  Related  Work 

Information  Extraction  [2]  systems  ([6],  [7],  [3],  etc)  are  using  techniques  to  col¬ 
lect  sets  of  templates  which  identify  interesting  pieces  of  texts  in  large  corpus. 
The  creation  of  the  template  is  done  via  a  training  on  a  part  of  the  corpus.  Some 
attempts  have  been  made  to  automate  the  creation  of  the  templates  (Autoslog- 
TS  [11])-  These  attempts  simplify  the  training,  but  do  not  eliminate  it.  Moreover 
it  is  not  possible  to  learn  new  templates  on  the  fly.  In  our  case,  a  requirement  as 
having  a  large  number  of  texts  to  train  on  before  hand,  is  not  acceptable.  While 
IE  techniques  are  very  inspiring  (templates,  natural  language  analysis),  they  are 
not  directly  applicable  to  the  support  for  repetitive  tasks.  In  [4],  R.  Grishman 
and  J.  Sterling  describe  the  generalization  of  acquired  semantic  patterns.  Their 
approach  is  based  on  the  computation  of  the  frequency  of  each  semantic  pat¬ 
tern  in  a  corpus.  Once  again  it  is  an  approach  that  we  cannot  adopt  since  we 
do  not  have  any  corpus  before  hand.  When  it  comes  to  providing  support  to 
repetitive  tasks,  the  APE  project  [12]  describes  a  technique  to  learn  incremen¬ 
tally  the  set  of  recurrent  sequences  of  commands  typed  or  clicked  by  users  in  a 
programming  environment  for  Smalltalk.  This  project  shows  the  usefulness  of  in¬ 
cremental  learning  in  a  system  which  records  the  actions  of  the  user  and  analyses 
them  to  identify  patterns.  However,  their  domain  of  command  sequences  is  less 
complex  to  reason  about  than  the  domain  of  natural  language.  Other  projects 
explore  the  domain  of  personal  assistance  on  repetitive  tasks.  CAL  [8]  refines 
the  principles  initiated  for  [9]:  “learning  through  experience”  a  set  of  rules  to 
manage  a  calendar.  The  concepts  are  predefined,  fixed,  and  limited  in  number 
(we  counted  22).  We  have  no  limitation  on  the  number  of  concepts  and  manage 
several  granularities  and  categories. 

7  Conclusion  and  Future  Work 

We  propose  a  method  for  template  generation  to  support  the  identification  of 
important  pieces  of  texts  with  respect  to  a  given  task.  This  method  learns  in¬ 
crementally  a  set  of  patterns.  The  set  is  completed  and  refined  after  each  new 
execution  of  the  task,  thus  providing  a  better  support  for  the  next  execution. 
We  described  a  solution  for  template  generation:  pattern  generalization.  Given  a 
pattern  which  triggers  the  operation  o,  pattern  generalization  generates  a  most 
general  template  that  recognizes  only  patterns  that  trigger  o.  We  have  tested 
the  pattern  generalization  on  a  set  of  patterns  extracted  from  a  real  life  task. 
The  results  show  that  the  performance  depends  on  the  completeness  and  gran¬ 
ularity  of  the  ontology,  and  on  the  relations  between  the  categories  of  concepts. 
As  further  work,  we  plan  to  investigate  the  management  of  the  templates.  The 
current  system  already  handles  the  refinement  of  templates  after  each  execution 
of  the  task.  It  should  also  be  able  to  recognize  relations  between  the  templates 
to  identify  inconsistencies  or  templates  that  cover  the  same  sets  of  patterns. 
We  also  plan  to  investigate  methods  to  incrementally  adapt  the  ontology  and 
to  produce  a  cooperation  between  the  semantic  and  grammatical  categories  of 
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concepts.  We  intend  to  establish  a  system  to  evaluate  the  performances  of  tem¬ 
plates.  This  system  would  allow  the  addition  of  new  rules  to  choose  the  most 
suitable  template  among  the  most  general  templates. 
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Abstract.  Implication  rules  have  been  used  in  uncertainty  reasoning 
systems  to  confirm  and  draw  hypotheses  or  conclusions.  However  a  ma¬ 
jor  bottleneck  in  developing  such  systems  lies  in  the  elicitation  of  these 
rules.  This  paper  empirically  examines  the  performance  of  evidential  in- 
ferencing  with  implication  networks  generated  using  a  rule  induction  tool 
called  KAT.  KAT  utilizes  an  algorithm  for  the  statistical  analysis  of  em¬ 
pirical  case  data,  and  hence  reduces  the  knowledge  engineering  efforts 
and  biases  in  subjective  implication  certainty  assignment.  The  paper  de¬ 
scribes  several  experiments  in  which  real-world  diagnostic  problems  were 
investigated;  namely,  medical  diagnostics.  In  particular,  it  attempts  to 
show  that  (1)  with  a  limited  number  of  case  samples,  KAT  is  capable 
of  inducing  implication  networks  useful  for  making  evidential  inferences 
based  on  partial  observations,  and  (2)  observation  driven  by  a  network 
entropy  optimization  mechanism  is  effective  in  reducing  the  uncertainty 
of  predicted  events. 


1  Introduction 

One  of  the  important  aspects  of  using  expert  systems  technology  to  solve  real- 
world  problems  lies  in  the  management  of  domain-knowledge  uncertainty.  Several 
methods  of  reasoning  under  uncertainty  have  been  proposed  in  the  past  [1]  [11] 
[13]  [15].  All  these  approaches  require  a  representation  of  domain  knowledge. 
Generally  speaking,  constructing  a  valid  knowledge  representation  is  a  time- 
consuming  task  and  often  subject  to  opinion  biases  or  semantics  invalidity  if  it  is 
built  purely  based  on  human  heuristics.  To  overcome  the  difficulties  in  knowledge 
acquisition,  several  investigations  have  been  carried  out  in  recent  years  to  explore 
the  effectiveness  and  validity  of  automated  means  such  as  algorithms  to  perform 
this  task. 

Pitas  et  al  [14]  have  proposed  a  method  of  learning  general  rules  from  specific 
instances  based  on  a  minimal  entropy  criterion.  Geiger  [7]  has  formulated  a 
learning  algorithm  for  uncovering  a  Bayesian  conditional  dependence  tree.  This 
algorithm  combines  entropy  optimization  with  Heckerman’s  similarity  networks 
modeling  scheme  [8].  Cooper  and  Herskovits  [2]  have  developed  an  algorithmic 
method  of  empirically  inducing  probabilistic  networks,  which  utilizes  a  Bayesian 
framework  to  assess  the  probability  of  a  network  topology  given  a  distribution 
of  cases.  A  heuristic  technique  is  provided  to  optimize  the  search  for  probable 
topologies.  Simulation  results  have  shown  that  a  small  37-node,  46-link  network 
can  be  derived  with  3,000  cases. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  474^185,  2000. 
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In  this  paper,  we  present  a  new  rule-learning  algorithm  for  inducing  impli¬ 
cation  relations  based  on  a  small  number  of  empirical  data  samples.  The  major 
difference  between  Cooper  and  Herskovits’  approach  and  ours  is  that  their  ap¬ 
proach  focuses  on  topological  induction  accuracy  while  ours  is  concerned  with 
the  accuracy  of  inferences  based  on  an  induced  network,  without  regards  to 
the  topological  uniqueness.  Our  approach  to  implication  network  induction  has 
been  implemented  in  a  tool  box  called  KAT,  which  contains  several  components; 
namely  empirical  data  acquisition,  implication  rule  elicitation  module,  network 
validation  module,  optimal  observation  determination  module,  and  embedded 
diagnostic  inferencing  engine  which  implements  uncertainty  reasoning  schemes. 

Our  approach  to  implication  induction  draws  on  the  previous  work  on  empir¬ 
ical  construction  of  inference  networks  [4].  The  present  study  further  extends  the 
earlier  work  by  augmenting  the  implications  with  certainty  measures.  Another 
related  work  is  the  development  of  a  prediction  logic  based  on  a  contingency-table 
of  probabilities,  as  proposed  by  Hildebrand  et  al.  [9].  In  their  work,  the  emphasis 
was  on  the  definition  and  computation  of  precision  and  accuracy  of  propositions 
represented.  An  analogy  was  made  between  contingency  table  based  prediction 
logic  and  formal  proposition  logic.  To  validate  the  implication  networks  gener¬ 
ated  from  KAT,  we  have  conducted  a  series  of  empirical  experiments  to  examine 
the  performance  of  evidential  inferencing  with  the  induced  networks.  The  chosen 
problem  domain  is  medical  diagnosis',  this  task  shares  many  commonalities  with 
other  real-world  problems  as  described  in  [1]  [6]  and  has  been  in  part  inspired  by 
earlier  studies  on  knowledge  space  theory  (KST)  by  Doignon  and  Falmagne  [5]. 
The  KST  presents  an  interesting  set-theory  interpretation  of  knowledge  states  as 
well  as  its  mathematical  foundations.  In  our  present  framework,  unlike  the  one 
by  Doignon  and  Falmagne,  the  interdependencies  among  knowledge  units  are  the 
closures  under  union  and  intersection,  which  can  be  correctly  represented  with  a 
directed  inference  network.  Hence,  our  implication  networks  representation  (i.e., 
an  instance  of  implication  networks)  can  be  viewed  as  a  proper  subset  of  the 
knowledge  space  representation. 

In  this  paper,  we  examine  the  effectiveness  and  exactness  of  inferences  with 
statistically  induced  networks.  Our  claim  is  that  the  proposed  network  induc¬ 
tion  method  is  capable  of  generating  logically  and  empirically  sound  implication- 
based  domain  representations  useful  in  predicting  unobserved  events  upon  re¬ 
ceiving  certain  partial  information.  While  validating  the  networks  in  several 
real-world  task  domains,  we  attempt  to  demonstrate  the  generality  of  the  al¬ 
gorithmic  rule  induction  and  reasoning  approach  in  solving  problems  where  a 
complete  set  of  events  is  too  difficult  to  observe  or  the  diagnostic  judgments  are 
subject  to  human  errors. 

2  Implication  Network  Induction 

In  the  present  work,  we  refer  the  term  implication  network  to  a  directed  acyclic 
graph  in  which  the  nodes  represent  individual  event  variables  or  hypotheses,  and 
the  arcs  signify  the  existence  of  direct  implication  (e.g.,  influence)  among  the 
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nodes.  The  value  taken  on  by  one  event  variable  is  dependent  on  the  values  taken 
on  by  all  variables  that  influence  it.  Each  value  indicates  the  likelihood  of  an 
unobserved  event.  The  value  is  updated  every  time  new  information  is  obtained 
(e.g.,  some  symptom  is  observed).  The  strengths  of  the  event  interdependencies 
are  quantified  by  functions  (e.g.,  belief  functions),  as  weights  associated  with  the 
arcs. 

Formally,  an  implication  network  can  be  represented  as  an  ordered  quadruple: 

Net  —  (Af ,  T,  trc,  Pfnin)  i  (1) 

where  N  is  a  finite  set  of  nodes  and  Z  is  a  finite  set  of  arcs.  ac  is  the  network 
induction  error  and  Pmin  is  the  minimal  conditional  probability  to  be  estimated 
in  the  arcs.  Furthermore,  each  induced  implication  rule  can  be  specified  by  the 
following  quadruple: 

Imp  =  ( Nant,Nconci ,  Wi,Wi),  (2) 

where  Wj  and  Wi  are  weight  functions  that  map  the  pairs  of  antecedent-consequent 
nodes,  i.e.,  Nant  and  Nconci,  and  their  negations  to  a  real  number  between  0  and 
1,  respectively.  That  is, 


Wi  :  Nant  x  Nconci  ->•  [0,1].  (3) 

Wi  :  -i Nconci  x  ->iVani  ->  [0,1].  (4) 


B  ->B 

A 
-•A 


Naab  Naa-b 

N-.AAB  N^AA-.B 


Fig.  1.  contingency  table  where  cells  indicate  the  number  of  co-occurrences. 


2.1  The  Rule-Elicitation  Algorithm 

The  basic  idea  behind  the  empirical  construction  is  that  in  an  ideal  case,  if 
there  is  an  implication  relation  A  =>•  B,  then  we  would  never  expect  to  find  the 
co-occurrences  as  in  Figure  1  that  event  A  is  true  but  not  event  B,  from  the 
empirical  data  samples.  This  translates  into  the  following  two  conditions: 

P(B\A)  =  1  (5) 

P(->A\->B)  =  1  (6) 

In  reality,  however,  due  to  noise  such  as  sampling  errors,  we  have  to  relax  Condi¬ 
tions  5  and  6.  KE  takes  into  account  the  imprecise/inexact  nature  of  implications 
and  verifies  the  above  conditions  by  computing  the  lower  bound  of  a  (1  —  aerror) 
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confidence  interval  around  the  measured  conditional  probabilities.  If  the  verifi¬ 
cation  succeeds,  an  implication  relation  between  the  two  events  is  asserted.  Two 
weights  are  associated  with  the  relation1,  which  correspond  to  the  relations’ 
conditional  probabilities  P(B\A)  and  P(-^A\->B).  In  fact,  these  weights  together 
express  the  degree  of  certainty  in  the  implication.  Once  an  implication  relation 
can  be  determined,  another  logical  operator  ”  is  readily  defined  as  follows: 

(A  =*  B)  =»  ((B  =►  A)  =*  (B  —  A))  (7) 

The  elicitation  of  dependences  among  the  nodes  requires  considering  the 
existence  (or  nonexistence)  of  direct  relationships  between  pairs  of  random  vari¬ 
ables  in  a  domain  model.  In  theory,  there  exist  six  possible  types  of  implications 
between  any  two  nodes  or  events. 

The  implication  rule  elicitation  algorithm  can  be  stated  as  follows: 

The  Rule-Elicitation  Algorithm 


Begin 

set  an  arbitrary  level  ac  and  a  minimal  conditional  probability  pmin  (this  test  can  be 
repeated  for  different  ac  and  pmin ■  An  example  is  ac  =  0.05  and  pm in  =  0.5). 
for  node;,  i  G  [0,  nmax  —  1]  and  nodej,  j  e  [i  +  1  ,nmax\ 


for  all  empirical  case  samples 
compute  a  contingency  table  Tij  =\ 


Nn  Ni2 
N2\  N22 


for  each  rule  type  k  out  of  the  six  possible  cases, 
test  the  following  inequality: 


P{x  ^  Aerror_ce!l )  ^  Oc 


(8) 


based  on  the  two  lower  tails  of  binomial  distributions  Bin(N:  p-min)  and 
Bin(N,  pmin),  where  N  and  N  denote  the  occurrences  of  antecedent 
satisfactions  in  the  two  inferences  using  a  type  k  implication  rule,  i.e.,  in 
modus  ponens  and  modus  tollens,  respectively.  ac  is  the  alpha  error  (or 
significance  level)  of  the  conditional  probability  test, 
if  the  test  succeeds 

return  a  type  k  implication  rule, 
endif,  endfor 
endfor 

endfor 

End 


Here  it  is  assumed  that  the  conditional  probability  is  p  in  each  sample,  and 
all  n  samples  are  independent.  If  X  is  the  frequency  of  the  occurrence,  then  X 

1  With  respect  to  the  two  directions  of  the  inference,  i.e.  modus  ponens  vs.  modus 
tollens. 
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satisfies  a  binomial  distribution,  i.e.,  X  ~  Bin(n,p),  whose  probability  function 
Px(k)  and  distribution  function  Fx{k)  are  given  below: 

Px(k)  =  (nk)pkqn~k 

F*(fc)=p(x<fc)  =  P  =  l-q 

3= 0  ^  ' 


Ai  A  Bi  C2  V  C3 
Ai  A  B2  — t'  Ci  V  C3 
A2  A  Bi  =?>  Ci  V  C2  V  C3 
a2  a  b2  C2 


Fig.  2.  A  contingency  table  where  cells  indicate  the  number  of  co-occurrences  in  the 
case  of  multivariate  implications. 

Thus,  the  test  of  hypothesis  for  A  =>  B  can  be  obtained  by  computing  by  a 
lower  tail  confidence  interval  over  a  binomial  function: 

A/’aa-.b  /  \ 

p(x<na^b)=  £  nP»-<(i-P)<  ai) 

i=0  '  ' 

where  n  has  the  same  definition  as  above,  and  where  p  is  set  to  the  desired 
minimal  conditional  probability.  This  formula  represents  the  probability  that 
as  small  a  number  as  X  of  unpredicted  results  would  be  observed  if  the  true 
probability  of  a  predicted  result  were  exactly  p.  The  smaller  the  probability 
given  by  the  formula  is,  the  less  likely  it  is  that  the  true  probability  of  a  predicted 
result  is  less  than  p. 

Prom  a  theoretical  point  of  view,  we  could  increase  the  dimensionality  of 
the  distribution  to  incorporate  all  variables  relevant  to  the  problem  in  question 
and  allow  the  variables  to  be  multivariate  as  illustrated  in  Figure  2.  In  such  a 
case,  the  probability  function  to  be  considered  becomes:  px u.  .„Xr(ki,...,kr)  = 
ip^1  . .  -Prr  ■  From  a  practical  point  of  view,  this  would  also  introduce  ex¬ 
ponential  computational  complexity.  In  the  present  study,  we  concentrate  on 
bivariate  variables  pairwise,  which  reduces  the  scope  of  problem  for  which  prob¬ 
abilities  have  to  be  elicited.  Often  this  is  known  as  naive  Bayes. 

2.2  An  Example  of  Positive  Implication  Induction 

The  following  section  illustrates  how  the  above  algorithm  is  used  to  verify  the 
existence  of  a  positive  implication  rule:  A  =>  B. 


Cl  c2  c3 

A 1  A  Bi 

■  □  □ 

Ai  A  B2 

□  ■  □ 

A2  A  Bi 

□  □  □ 

A2  A  jE?2 

■  □  ■ 

(9) 

(10) 
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In  the  first  step  of  positive  implication  rule  induction,  a  two-dimensional 
contingency  table  for  variables  A  and  B  is  compiled.  As  computed  from  an 
empirical  data  set,  the  cells  in  the  contingency  table  contain  the  observed  joint 
occurrences  for  the  respective  four  possible  combinations  of  values.  Table  1  shows 
an  example  of  the  contingency  table  with  respective  co-occurrences  of  variables 
A  and  B  in  a  hypothetical  data  set. 


B 

-iB 

A 

20  {Naab) 

1  (Naa^b) 

->A 

|8  {N^aab)  1  ( hf— . aa—ib) 

Table  1.  Distribution  of  observed  occurrences 


where  N,,  denotes  the  occurrences  of  the  respective  situations.  The  total 
numbers  of  A  and  ~>B  can  be  derived  accordingly  as  follows: 

A rA  =  Naab  +  Naa^b  =  21 
Af-.B  =  Af4A-.fi  +  Af-,4/\-,B  =  2 

Statistical  Tests  for  Implication  Existence:  The  second  step  of  our  in¬ 
duction  method  consists  of  an  assessment  of  the  numerical  constraints  imposed 
by  A  =>  B.  More  specifically,  the  assessment  is  based  on  the  lower  tails  of  bino¬ 
mial  distributions  Bin(NA,  Pmin )  and  Bin(N~,s ,  Vmin )  to  test  measured  con¬ 
ditional  probabilities  P(B  |  A)  and  P(-^A  \  -> B ),  where  Na  =  Naab  +  Af4A-.fi, 
N^b  =  Naa-iB  +  AA, yiA-.fi ,  and  pm;n  is  an  arbitrary  number  chosen  as  the 
minimal  conditional  probability  for  an  implication  relation.  For  each  of  the  two 
binomial  distributions,  we  check  to  see  whether  Inequality  8  can  be  satisfied. 

Suppose  that  in  this  example,  pmin  =  0.85;  ac  =  0.20.  Accordingly  the 
binomial  distribution  for  testing  P(B  \  A)  can  be  written  as:  f?m(21, 0.85).  The 
computation  of  the  lower  bound  proceeds  as  follows: 

P{x  <  Naa-^b)  =  P(x  <  1) 

=  P{x  =  0)  +  P(  x  =  1) 

0.8521  0.15°+  0.8520  0.151 

=  0.155 


hence  P(x  <  Af4A-.fi)  <  where  symbol 


j 

k 


represents  the  number  of  com¬ 


binations  of  k  in  j.  The  inference  with  A  =>  B  in  the  modus  ponens  direction  is 
significant  with  confidence  level  (1  —  ckc).  In  a  similar  way,  given  Bin{ 2,0.85), 
the  test  for  P(-^A  \  ->B)  yields: 


0.852  0.15°  + 


P(x  <  Naa^b)  = 


2 

1 


0.851  0.151  =  0.98 
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hence, 


P(x  <  Naa^b)  ac 


Since  Inequality  8  for  the  test  of  P{-<A\-^B)  is  not  satisfied,  A  =>  B  cannot  be 
used  for  modus  tollens  inference.  Hence,  the  positive  implication  rule  A  =>  B  is 
rejected.  The  overall,  worst-case  time  complexity  of  inducing  an  implication 
network  with  the  above  algorithm  is  0(nmax2)  where  nmax  is  the  number  of 
nodes  for  modeling  the  domain. 


3  Empirical  Cases 

This  section  describes  the  empirical  data  used  in  a  series  of  experiments  aimed 
to  investigate  the  effectiveness  and  exactness  of  induced  implication  networks  in 
diagnostic  reasoning.  The  selected  task  domain  is  medical  diagnosis. 

In  the  current  study,  we  model  the  different  possible  knowledge  states  by 
a  partial  order.  Although  this  formalism  could  not  fully  represent  all  possible 
knowledge  states,  it  captures  a  large  part  of  the  constraints  on  the  ordering 
among  KU  and  can  be  used  for  the  purpose  of  automatic  knowledge  assessment 
[3].  The  data  used  to  induce  implication  networks  for  medical  diagnosis  consists 
of  a  set  of  attributes  which  are  continuous  variables.  In  order  to  build  a  network, 
these  attributes  were  first  transformed  into  bivariate  (i.e. ,  binary)  values  using 
thresholds. 

The  medical  diagnostic  method  developed  in  this  work  was  first  validated 
using  the  empirical  cancer  data  samples  collected  from  69  healthy  people  and  31 
cancer  patients.  Each  sample  contains  the  information  on  22  chemical  residues 
(i.e.,  attributes)  found  in  a  bioposy.  In  order  to  build  the  network,  we  first 
transformed  the  ordered  continuous  variables,  i.e.,  trace  element  concentrations, 
into  two-valued  Boolean  variables,  by  means  of  thresholding. 


Zn  =>  Mg  0.7826  0.7959 
Zn  =>  Ca  0.8695  0.8775 
Zn  =>  Cu  0.6956  0.7454 
Co  =>  Ni  0.7297  0.8076 


Cd  =>  Zn  0.7096  0.8333 
Cd  =>  Ni  0.8064  0.8846 
Cd  =>  Co  0.7096  0.8571 
Cd  =>  Cu  0.8064  0.8909 


Mg  =>•  Ca  0.8823  0.8775 
Mg  =*  Cu  0.7058  0.7272 
V  =*■  Ni  0.7058  0.8076 
Cu  =►  Ca  0.7555  0.7755 


Table  2.  The  original  trace  concentration  data  samples. 


The  derived  data  set  was  used  to  induce  the  network.  Tables  3  and  4  show  a 
few  examples  of  the  original  and  the  derived  data  set  samples,  respectively.  Table 

3  presents  a  subset  of  the  induced  implication  network  in  the  form  of  pairwise 
gradation  relations. 

4  Evidential  Inferences 

To  validate  the  accuracy  of  the  evidential  inferences  generated  from  implication 
networks,  we  have  conducted  a  series  of  experiments  in  simulated  diagnostic  task 
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Zn  Pb  Ni  Co  Cd  Mn  Cr  Mg  V  A1  Ca  Cu  Ti  Se  Categ. 

237.84  8.50  1.532  1.045  0.590  1.953  1.717  223.62  1.696  0.010  1806.75  8.71  0.732  0.001  1 

203.15  12.70  2.362  1.707  0.898  1.347  1.204  46.33  0.811  4.189  405.20  13.92  0.689  0.001  1 

266.34  4.44  0.085  1.013  0.382  2.151  0.340  47.73  0.010  13.137  367.92  17.10  2.896  0.003  2 


Table  3.  The  transformed  trace  concentration  data  samples  (subset). 

Zn  Pb  Ni  Co  Cd  Mn  Cr  Mg  V  A1  Ca  Cu  Ti  Se  Category 

01.00  01.00  01.000  01.000  01.000  1.000  1.000  ol.oo  1^000  0.000  01.00  00.00  1.000  0.000  T 

01.00  01.00  01.000  01.000  01.000  1.000  1.000  00.00  1.000  0.000  00.00  01.00  0.000  0.000  1 

01.00  01.00  00.000  01.000  01.000  1.000  1.000  00.00  0.000  1.000  00.00  01.00  1.000  0.000  2 


Table  4.  Examples  of  the  induced  positive  implication  rules  (subset). 


settings.  In  particular,  we  used  constructed  implication  networks  as  the  basis 
for  evidential  inferences.  Each  simulation  run  consisted  of  selecting  a  portion 
of  a  subject’s  sample  data  and  propagating  evidential  supports  throughout  the 
network. 

4.1  Experimental  Method 

There  exist  various  interpretations  of  the  imprecision  measure  associated  with 
an  implication  rule  [11],  Each  interpretation  dictates  the  way  in  which  inferences 
are  to  be  performed.  Bayesian  inference  is  based  on  the  mapping  of  an  implica¬ 
tion  relation  into  conditional  probabilities  [13].  Taking  an  implication  A  =>-  B  for 
example,  updating  the  probability  would  be  based  upon  P(B  \  A),  which  should 
approach  1.0  if  the  implication  is  strong.  The  difficulty  with  this  scheme  stems 
from  the  fact  that  if  further  observation  of  C  is  obtained  and  if  there  is  a  relation 
C  =4>  B,  then  there  is  a  need  to  update  the  value  of  B  based  upon  P(B  \  A,C ), 
and  so  on.  As  more  observations  occurs,  the  conditional  probabilities  become 
practically  impossible  to  estimate,  whether  subjectively  or  from  sample  data. 
To  address  this  difficulty  in  a  Bayesian  belief  network,  the  assumption  of  inde¬ 
pendence  is  made  between  individual  implication  relations.  In  the  present  work, 
we  have  applied  the  Dempster-Shafer  (D-S)  method  of  evidential  reasoning  to 
propagate  supports  (whether  confirming  or  disconfirming)  throughout  the  im¬ 
plication  network.  The  D-S  inferencing  scheme  may  be  regarded  as  a  complex 
theoretical  deviation  from  the  Bayesian  theory.  According  to  the  D-S  scheme, 
the  set  of  possible  outcomes  of  a  node  is  called  the  frame  of  discernment,  de¬ 
noted  by  0.  If  the  antecedents  of  a  rule  confirm  a  conclusion  with  degree  m,  the 
rule’s  effect  on  belief  in  the  subsets  of  0  can  be  represented  by  so-called  proba¬ 
bility  masses.  In  our  bivariate  case  of  knowledge  assessment,  there  are  only  two 
possible  outcomes  for  each  node,  qi ,  that  is,  0  =  {known,  -iknown} . 

The  D-S  scheme  provides  a  means  for  combining  beliefs  from  distinct  sources, 
known  as  Dempster’s  rule  of  combination.  This  rule  states  that  two  assignments, 
corresponding  to  two  independent  sources  of  evidence,  may  be  combined  to  yield 
a  new  one,  that  is, 

m{X)  =  k  ^  mi(Xi)m2(Xj) 

XinXj=x 


(12) 
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where  k  is  a  normalization  factor.  Another  evidential  inference  methodology, 
called  Certainty  Factors  (CF)  as  previously  implemented  in  MYCIN  [1],  was 
also  applied  in  this  study.  This  approach  may  be  viewed  as  a  special  case  of 
the  D-S  evidential  reasoning.  The  two  approaches  differ  from  each  other  only  in 
combining  two  opposite  beliefs  (i.e.,  one  confirming  and  the  other  disconfirming). 


4.2  Results  in  Medical  Diagnosis 

This  section  presents  the  empirical  results  of  evidential  inferences  using  the 
databases  of  cancer  diagnosis  instances  as  mentioned  in  Section  3.  In  each  of 
the  two  experiments,  the  numeric-valued  attributes  were  first  discretized  into 
binary  values  which  were  then  used  for  both  network  induction  and  inferencing 
validation. 

In  the  case  of  cancer  diagnosis,  40  patient  samples  were  compiled  to  induce 
the  implication  network  with  Pmin  >  0.5  and  ac  <  0.30.  The  generated  network 
contains  87  implication  relations.  Another  set  of  60  patient  samples  was  used  to 
validate  the  evidential  inferencing. 

During  the  validation,  a  certain  percentage  of  attributes  in  each  test  cases 
were  randomly  sampled,  and  the  rest  of  the  attributes  were  inferred  from  the 
implications.  Upon  the  completion  of  inferencing,  a  pair  of  thresholds  (u,  v)  (i.e., 
bi-directional  thresholds)  was  defined  to  filter  the  numeric-valued  weights.  That 
is,  if  a  specific  node  has  a  weight  w  >  v,  then  the  node  is  believed  to  be  TRUE.  On 
the  other  hand,  if  w  <  u,  the  node  is  believed  to  be  FALSE  (i.e.,  the  corresponding 
attribute  does  not  exist) .  The  resulting  filtered  predictions  were  compared  with 
the  actual  values  in  the  test  samples. 


4.3 


Experiment  E-5 


Cancer  Diagnosis 


Globally  speaking,  given  the  distributions  of  evidentially  predicted  weights  and 
initial  weights  with  respect  to  various  bi-directional  thresholds,  it  can  be  ob¬ 
served  that  in  the  guessing  case,  both  the  correctly  predicted  nodes  and  the 
errors  were  almost  the  linear  functions  of  the  observation  rate.  However,  in  the 
evidential  inferencing  case,  the  shapes  of  these  two  rate  profiles  were  changed, 
indicating  that  as  the  observation  increased,  additional  nodes  were  added  to 
both  the  correct  predictions  and  the  errors.  It  should  also  be  noted  that  the 
error  rates  in  the  inferencing  case  were  quickly  stablized  after  the  amount  of 
observation  exceeded  a  certain  percentage. 

To  further  compare  the  results  of  inference-based  prediction  and  initial  weight- 
based  guessing,  a  pair  of  bi-directional  thresholds  was  picked  up  from  each  of 
the  two  figures  such  that  the  selected  two  cases  would  have  similar  error  rates. 
At  0%  sampling,  the  inferencing  case  predicted  about  45%  due  to  its  conser¬ 
vative  thresholding.  However,  as  the  observation  increased,  correct  predictions 
were  quickly  added  along  with  some  wrong  predictions.  The  evidential  inferenc¬ 
ing  resulted  in  a  consistently  better  performance  in  evaluating  the  unobserved 
nodes  when  the  observation  sampling  exceeded  18%,  as  compared  to  the  pure 
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initial  weight  based  guessing.  For  instance,  at  45%  observation,  the  inferencing 
method  correctly  predicted  4%  more  attributes  than  the  guessing  method. 

5  Entropy-Driven  (ED)  Diagnosis  Based  on  Induced 
Networks 

In  diagnostic  reasoning,  various  rules  may  be  applied  to  determine  which  node  is 
to  be  observed  next.  One  approach  is  to  randomly  choose  symptom  nodes  from  a 
complete  symptom  set  that  spans  all  the  symptoms  in  the  diagnostic  structure, 
as  studied  in  the  previous  section.  Another  approach  is  to  apply  entropy  opti¬ 
mization  and  choose  the  most  informative  node.  This  section  investigates  the 
performance  of  entropy-driven  (ED)  evidential  inferences  based  on  the  induced 
implication  networks. 

In  the  following  experiments,  the  expected  information  yield  of  each  indi¬ 
vidual  node  over  all  the  possible  outcomes  is  computed  and  weighted  by  the 
likelihood  of  each  outcome.  The  node  that  has  the  maximum  expected  infor¬ 
mation  yield  is  chosen  as  the  potentially  most  informative  one,  which  is  to  be 
observed  next.  Formally,  the  expected  information  yield  of  an  observation  is 
defined  as  follows: 

—  Ecur(net)  F/e3,p(7iet) 

=  Ecur(net)  —  \piE(net  |  node*  =  TRUE)  +  (1  — pi)E(net  |  nodei  =  FALSE)] 

Umax  Elmax 

=  Pi  (P'k  {°gP'k  +  Pk  (Pk  l0gPk  +  P'k  l°SPk') 

k= 1  k= 1 

where  Ecurrent(net)  denotes  the  current  network  entropy.  E(net  |  •)  denotes  the 
updated  network  entropy  having  observed  nodei.  pz  is  the  current  probability 
of  nodej  =  TRUE.  p'k  and  p".  are  the  updated  probabilities  of  a  network  nodek, 
having  observed  that  nodei  =  TRUE  and  nodei  =  FALSE,  respectively. 

In  what  follows,  we  examine  the  diagnostic  performance  at  the  level  of  in¬ 
dividual  nodes.  The  performance  is  analyzed  with  respect  to  three  observation 
modes,  which  are: 

(I)  inferences  based  on  the  E.D.  observation :  nodes  are  given  initial  probabilities 
(i.e.,  averaged  weights).  If  a  node  is  observed  to  be  TRUE,  it  is  assigned  0.9 
and  0.1  otherwise,  taking  into  account  the  random  error  in  the  original  data. 
Inference  propagation  is  performed  based  on  that  observed  node; 

(II)  inferences  based  on  random  observation  (as  in  the  previous  section) :  same 
as  (I)  but  nodes  are  chosen  at  random;  and, 

(III)  no  inference  condition  (or  guessing) :  same  as  (II)  but  no  inference  propaga¬ 
tion  is  performed. 

Since  the  comparison  between  the  D-S  and  Certainty-Factors  approaches,  as 
presented  in  the  preceding  section,  does  not  reveal  any  significant  performance 
difference,  here  we  shall  focus  on  the  methods  of  observation  with  the  D-S  evi¬ 
dential  inferencing  only. 
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5.1 


Experiment  E-ll 


Cancer  Diagnosis 


This  section  examines  the  performance  of  evidential  inferences  under  E.D.  ob¬ 
servation  mode.  The  networks  to  be  tested  in  the  following  two  experiments 
are  the  same  as  the  ones  used  in  the  random  mode  observation  as  described  in 
Section  4.2.  During  the  validation,  the  inferred  attributes  were  accepted  based 
upon  a  pair  of  thresholds  (u,  v )  for  filtering  the  numeric-valued  weights,  and  then 
compared  with  the  actual  discretized  attribute  values  in  the  samples. 

Unlike  the  distribution  in  experiment  E-5,  the  distributions  of  the  weigh t- 
based-guessing-only  mode  have  become  non-linear  to  the  observation  rate.  This 
indicates  that  the  E.D.  observation  tends  to  pick  up  the  nodes  with  relatively 
higher  uncertainty.  At  the  same  time,  the  inferences  with  E.D.  observation  added 
more  information  than  the  purely  weight-based  guessing  with  the  same  obser¬ 
vation  mode,  revealing  that  the  selection  of  the  nodes  was  not  based  purely  on 
the  present  weights  of  the  nodes  but  also  their  connectivities  in  the  network. 

A  main  result  may  be  stated  that  if  the  E.D.  observation  sampling  is  more 
than  13%,  the  performance  of  inferencing  is  consistently  better  than  that  of 
guessing.  For  instance,  at  45%  observation,  the  inferencing  scheme  produces 
11%  additional  correct  predictions  as  compared  to  the  pure  guesses. 

For  evidential  inferencing  with  two  different  observation  modes,  i.e.,  E.D.  vs. 
random,  the  results  are  significantly  different.  In  the  former  observation  mode, 
the  correctly  predicted  nodes  at  45%  observation  can  reach  up  to  87%,  whereas 
the  latter  produces  around  81%  given  the  same  amount  of  observation.  In  the 
random  mode,  it  requires  at  least  18%  observation  in  order  for  the  inferencing 
scheme  to  show  better  performance.  In  the  present  E.D.  mode,  this  percentage 
is  further  lowered  to  13%. 


6  Conclusion 

In  this  paper,  we  have  described  a  series  of  empirical  validation  experiments 
which  examined  the  performance  of  evidential  inferences  based  on  implication 
networks  that  were  induced  by  a  rule  learning  tool  (KAT).  In  the  experiments, 
building  implication  networks  for  evidential  inferencing  in  various  real-world 
diagnostic  task  domains  (as  shown  in  the  experiments,  some  may  have  less  strong 
implications  than  the  others)  is  translated  into  the  task  of  statistically  induction, 
from  a  small  number  of  individual  instances  or  empirical  data  samples  (e.g.,  the 
sizes  of  the  samples  for  the  experiments  are  respectively  47,  20,  40,  and  153). 
Generally  speaking,  evidential  inferencing  with  such  induced  networks  is  effective 
in  generating  valid  predictions  about  unobserved  events  such  as  knowledge  units 
and  diagnostic  attribute  values. 

This  study  also  explored  an  E.D.  diagnostic  method  and  compared  its  per¬ 
formance  with  a  random  sampling  method.  The  result  of  comparisons  has  shown 
that  while  both  the  random  and  the  minimum-entropy-based  methods  are  de¬ 
sirable,  the  latter  is  in  general  far  better  for  reducing  uncertainties,  especially 
when  the  observation  rate  is  more  than  13%  (e.g.,  as  shown  in  Experiments  7, 
11,  and  14). 
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As  validated  in  the  cancer  experiments,  the  binary  representation  of  diag¬ 
nostic  attributes  enables  the  induction  of  valid  implication  networks,  which  are 
useful  not  only  in  the  predictions  of  unobserved  attributes  but  also  in  patient 
diagnostic  classification.  The  conducted  experiments  also  reveal  that  the  impli¬ 
cation  network  is  less  sensitive  to  the  particular  inferencing  scheme  performed. 
In  addition  to  the  D-S  and  Certainty  Factors  schemes  of  evidential  inferenc¬ 
ing,  we  have  also  implemented  and  applied  other  schemes  such  as  the  Bayesian 
approach,  with  very  little  variation  in  the  performance. 
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Abstract.  This  paper  continue  the  study  of  machine  oriented  models 
initiated  by  the  second  author.  An  attribute  value  is  regarded  as  a  name 
of  the  collection  (called  granule)  of  the  entities  that  have  the  same  pro¬ 
perty  (specified  by  the  attribute  value).  The  relational  model  uses  these 
granules  (e.g.,  bit  representation  of  subsets)  as  attribute  values  is  called 
machine  oriented  data  model.  The  model  transforms  data  mining,  par¬ 
ticularly  finding  association  rules,  into  Boolean  operations.  This  paper 
show  that  this  approach  speed  up  data  mining  process  tremendously;  in 
the  experiments,  it  is  approximately  50  times  faster,  the  pre-processing 
time  was  included). 

Keywords:  data  mining,  association  rules,  Boolean  operation,  machine 
oriented  model,  granular  computing 


1  Introduction 

Data  mining  is  the  search  for  hidden  information  from  the  stored  data.  Typically, 
it  investigates  the  relationship  of  attribute  values  among  tuples  by  machine.  In 
other  words,  the  meaning  (to  human)  of  attribute  values  plays  no  role  in  the 
processing.  So  we  replace  meaningful  attribute  value  by  granules  (the  set  of 
entities  that  has  the  property  specified  by  the  attribute  value).  Let  U  be  the  set 
of  entities  and  R  the  relation  (representing  U)  under  consideration.  Each  granule 
(as  a  subset  of  the  universe  U)  is  represented  by  bit  patterns.  This  paper  is  a 
continuations  of  [4],  [5].  We  will  explain  how  this  data  model  is  represented 
and  how  this  model  can  be  applied  to  speed  up  data  mining.  The  fundamental 
techniques  here  is  the  computing  of  granules-granular  computing.  In  this  paper, 
the  granulation  is  partition;  So  it  is  an  extended  rough  set  theory.  We  will  discuss 
the  technique  to  implement  a  database  using  equivalence  classes  (granules)  and 
how  is  applicable  in  mining  association  rules. 

2  Representations  and  Equivalence  Classes  (Granules) 

Let  U  be  the  universe  of  discourse,  a  classical  set.  Its  elements  will  be  referred  to 
as  entities  or  objects.  In  relational  database  theory,  U  is  the  set  of  entities  that  are 
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represented  faithfully  by  a  relation  (not  a  relation  scheme)  [2] .  Attribute  values 
correspond  to  the  properties  of  entities.  We  will  refer  to  them  as  elementary 
concepts.  The  collection  of  elementary  concepts  will  be  denoted  by  C.  Since  the 
correspondence  between  entities  and  tuples  is  one  to  one  in  both  directions.  So 
we  can  also  regard  U  as  the  relation  (=  the  set  of  tuples),  or  more  precisely,  the 
set  of  tuple-ids  or  tuple-names. 

A  binary  relation  is  an  equivalence  relation  if  it  satisfies  the  three  properties, 
reflexivity,  symmetry,  and  transitivity.  Values  of  the  attribute  A  will  be  abbre¬ 
viated  as  A-values.  The  attribute  A  induces  an  equivalence  relation  on  U  as 
follows:  Two  tuples  (entities)  are  equivalent  iff  the  corresponding  A- values  (pro¬ 
perty)  are  the  same.  The  equivalence  relation,  by  abuse  of  notation,  denoted  by 
A  again,  partitions  U  into  mutually  disjoint  equivalence  classes;  we  may  refer 
to  them  as  granules  for  short.  Mathematically,  the  attribute  A  induces  a  projec¬ 
tion  from  the  relation  to  A-component.  The  projection  induces  an  isomorphism 
U/A  =  Dom(A).  One  can  regard  each  attribute  value  as  the  ’’meaningful  name 
or  id”  of  the  granule  via  this  isomorphism.  A  granule  is  a  subset  of  U,  at  the 
same  time,  it  is  an  element  of  U/A.  We  will  regard  the  latter  as  the  canonical 
name  of  the  former.  The  canonical  name  of  a  granule  may  be  represented  by  an 
explicit  list  of  tuple-ids  or  binary  representation  of  the  granule  as  a  subset  of 
U.  In  binary  form,  the  value,  1  or  0  at  certain  position  indicates  whether  the 
certain  tuple  holds  the  particular  attribute  value  or  not. 


Table  1.  A  Two  Column  Relation 


Car  Ids 

Car  Type 

IDi 

Sedan 

id2 

Sport  Utility 

id3 

Sedan 

IDi 

Mini  Van 

IDs 

Coupe 

IDs 

Mini  Van 

IDr 

Sport  Utility 

IDs 

Sedan 

ID9 

Sedan 

IDio 

Sport  Utility 

I  Du 

Station  Wagon 

id12 

Sedan 

An  example  relation  on  vehicles  consists  of  two  columns,  vehicle  ID  and  vehicle 
type,  is  used  to  illustrate  the  idea;  see  Table  1.  The  values  in  the  ID-column, 
identify  uniquely  the  tuples  (entities).  This  characteristic  of  the  column,  where 
attribute  values  are  all  unique,  is  not  useful  in  data  mining  since  the  column 
values  has  the  one-to-one  correspondence  to  the  tuple  itself.  But  it  will  be  used 
here  to  reference  tuples  (as  names)  in  the  relation.  The  vehicle  type,  however,  is 
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not  unique.  This  column  is  refer  to  certain  property  of  vehicles,  a  potentially  use¬ 
ful  attribute  for  data  mining  algorithms.  Table  1  contains  the  following  vehicles: 
six  sedans,  two  mini  vans,  three  sport  utility,  one  coupe,  and  one  station  wagon. 
The  quotient  set  is  {Sedan,  Sport  Utility,  Mini  Van,  Coupe,  Station  Wagon  }. 
Each  element  is  the  meaningful  name  of  a  granule  (equivalence  class).  Table  2 
displays  the  partition  in  two  forms,  list  and  binary  representations.  Both  forms 
represent  the  elements  in  the  quotient  set  (or  subsets  of  U).  Its  first  column 
contains  the  name  of  each  element  in  quotient  set,  on  the  second  column,  the 
binary  representation,  and  on  the  third  column,  the  list  representation. 

Note  that  each  element  in  the  quotient  set  is  a  subset  of  U.  The  number  of 
elements  in  the  quotient  set  is  much  more  smaller  than  the  number  of  rows;  the 
quotient  set  is  the  domain,  to  be  precise,  the  active  domain  [2].  Roughly,  we 
have  reduced  the  problem  from  the  universe  to  the  quotient  sets  (the  domain); 
this  is  an  essential  idea  in  granular  computing.  A  table  of  multiple  columns  then 
is  a  family  of  partitions.  In  the  next  section,  we  will  discuss  how  data  mining 
algorithm  can  run  faster  using  these  binary  representations. 


Table  2.  Partition  by  Attribute  Values;  list  representation 


Meaningful  Names 
of  Granules 

List  Representation 
of  Granules 

Binary  Representation 
of  Granules 

Sedan 

=  {ID1,ID3,ID8,ID9,ID12} 

=  101000011001 

Sport  Utility 

=  {ID2,ID7,ID10} 

Mini  Van 

=  {ID4,ID6} 

|=  000101000000  I 

Coupe 

=  {IDs} 

. . . . 

Station  Wagon 

=  {IDn} 

Binnin>nniiHin»MgarcHi' 

3  Using  Equivalence  Relations  in  Data  Mining 
Algorithms 

One  common  data  mining  question  is  to  find  all  association  rules  in  a  given 
relation.  Let  A  and  Y  be  two  elementary  concepts  (attribute  values)  in  the 
relation.  An  association  rule,  (A,  Y),  exists  if  it  appears  in  the  s%  of  a  given 
relation.  In  other  words,  s%  of  the  tuples  contains  (A,  Y)  as  sub-tuples. 

There  are  many  papers  reporting  various  algorithms  to  discover  association 
rules  [6],  [7].  In  general,  the  procedure  to  find  association  rule  is,  first,  to  generate 
possible  combinations  (patterns)  of  attribute  values,  second,  to  count  the  number 
of  times  each  possible  patterns  appears  in  the  relation,  and  finally  to  declare  the 
ones  that  meet  or  exceed  the  s%  of  tuples  in  the  relation  an  association  rule. 
Generally,  the  goal  of  these  algorithms  lies  in  reducing  the  number  of  checking  the 
possible  combinations  in  the  relation.  Below  we  will  demonstrate  the  use  of  the 
granules  (equivalence  classes)  to  find  the  association  rules.  In  terms  of  granules, 


Finding  Association  Rules  Using  Fast  Bit  Computation  489 


the  pattern,  (X,Y),  is  an  association  rule  if  the  bit  count  of  the  intersection 
InY  is  equal  or  greater  than  the  percentage,  s%,  of  the  relation  [4]  The  bit 
count  is  the  number  of  l’s  appearing  in  the  bit  stream  that  represents  the  granule 
as  a  subset  of  U.  For  simplicity,  we  will  use  X  n  Y  to  mean  the  association  rule 
(X,  Y).  The  pattern  XnY  can  be  generalized  to  the  form  of  intersecting  multiple 
granules:  Xx  n  X2  n  X3  n . Xn  D  Yx  n  Y2  n  Y3  n  . . .  Yn. 


Table  3.  A  Car  Relation 


Car  Ids 

Car  Type 

Color  Type 

Cost  Type 

IDi 

Sedan 

Red 

Moderate 

id2 

Sport  Utility 

Blue 

Expensive 

ID3 

Sedan 

Green 

Expensive 

IDi 

Mini  Van 

White 

Moderate 

ID5 

Coupe 

Red 

Cheap 

IDs 

Mini  Van 

Red 

Expensive 

ID7 

Sport  Utility 

Black 

Moderate 

IDs 

Sedan 

Blue 

Expensive 

IDs 

Sedan 

Green 

Expensive 

I  Dio 

Sport  Utility 

Black 

Moderate 

/Un 

Station  Wagon 

Green 

Moderate 

ID\2 

Sedan 

Blue 

Expensive 

Table  3  is  an  extension  from  the  previous  vehicle  relation.  Two  more  columns, 
Color  Type  and  Cost  Type,  are  added.  Each  column  has  its  own  domain  of 
attribute  values  as  well  as  its  own  granules.  The  quotient  sets  for  the  last  two 
columns  are  displayed  on  Figure  1. 

*  Color  Type  Quotient  Set 

1.  Red  =  {IDl,ID5,ID6} 

2.  Blue  =  {ID2,  ID8,  ID12} 

3.  Green  =  {ID3,  ID9, 1 Dll} 

4.  White  =  {IDA} 

5.  Black  =  {ID7,  ID10} 

*  Cost  Type  Quotient  Set 

1.  Cheap  =  {ID5} 

2.  Moderate  =  {ID1,IDA,ID7, 1  DIO,  I  Dll} 

3.  Expensive  =  {ID2,ID3,ID6,ID8,ID9,ID12} 

Figure  1  Quotient  Sets 

From  this  small  Table  3,  one  could  conjecture  that  almost  all  mini  vans  are  white, 
or  66%  of  the  expensive  vehicles  are  sedans.  Or,  one  could  ask  the  question  what 
type  of  expensive  vehicle  exists  in  more  than  10%  of  the  data.  To  answer  this 
last  question,  each  granule  of  the  quotient  set,  car  type,  must  be  checked  with 
the  granule,  expensive,  from  the  quotient  set,  cost  type,  to  determine  which 
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these  combinations  (sub-tuples  of  the  Cartesian  product  of  quotient  sets)  are 
an  association  rule.  Each  combinations  requires  the  Boolean  AND  operation 
between  the  two  class  types.  There  are  five  combinations.  Table  4  shows  the 
operations  on  those  five  combinations,  the  results  on  the  intersection,  and  the 
bit  counts  on  the  results. 

Table  4.  AND  operation  of  granules 


Combination 

Binary  Form 

Result 

Sport  utility  AND 
expensive 

{010000100100} n {011001011001} 

1 

Mini  van  AND 
expensive 

{000101000000} n {011001011001} 

1 

Coupe  AND 
expensive 

{000010000000}  n {011001011001} 

{000000000000} 

II 

Station  wagon  AND 
expensive 

{000000000010} n  {011001011001} 

{000000000000} 

0 

Sedan  AND 
expensive 

0 

0 

0 

0 

0 

7—4 

0 

c 

r-*—\ 

I—t 

0 

0 

0 

0 

0 

0 

0 

7— 1 

{001000011001} 

4 

The  combination,  sedan  and  expensive,  is  the  only  one  that  meet  or  exceed 
10%  of  the  rows  in  the  table,  vehicle.  So,  there  is  only  one  association  rule  out  of 
those  five  combinations.  The  number  of  combinations  to  check  is  small  compared 
to  the  more  general  question  of  finding  all  combinations  between  the  car  type  and 
cost  type  that  represents  10%  of  data.  For  this  case,  there  are  25  combinations 
to  check,  (5-car  type  times  5-color  type). 

Determining  association  rules  can  be  a  very  computational  process  when 
there  are  many  quotient  sets  to  consider.  For  two  columns,  the  combinations 
are  the  pairs  (2-tuples)  of  Cartesian  product  of  the  two  quotient  sets,  Q i  x  Q2- 
The  number  of  combinations  is  the  product,  Card  (Qi  )Card  (Q2),  of  the  two 
cardinal  numbers.  In  general,  the  number  of  combinations  is  huge  if  the  question 
touches  on  many  columns  (quotient  sets)  and  each  column  has  many  elements  in 
it.  Below,  we  explain  the  methods  in  conjunction  with  the  granules  to  lessen  the 
number  of  possible  combinations  and  to  reduce  the  computation  in  determining 
whether  a  particular  combination  is  an  association  rule. 

4  The  Algorithms  and  Comparisons 

Let  Ai,i  =  1,2,...  be  the  attributes  of  a  relation  R.  Each  column  Aj  gives  rise 
to  a  quotient  set  Qi  =  U/Ai.  Each  tuple  in  R  induces  a  tuple  in  the  Cartesian 
product  of  Q 1  x  Q2  For  simplicity  and  clarity,  the  term  tuple  will  be  reserved 
for  the  relation  R ,  and  the  tuple  and  its  sub-tuples  in  the  Cartesian  product 
of  quotient  sets  will  be  referred  to  as  combinations  of  granules  from  Qi,i  = 
1,2,  —  A  A  combination  of  length  q  is  called  (/-combination.  A  granules  is  said 
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to  be  large,  if  it  meets  s%  percentage,  that  is,  Card(Qj)/Card(f7)  >  s%.  A  q- 
combination  (of  granules)  is  said  to  be  large,  if  the  intersection  of  the  granules 
in  the  combination  is  large. 

-  A  large  ^-combination  forms  an  association  rule  (of  length  q) . 

The  task  in  this  section  is  to  develop  an  efficient  algorithms  to  determine  which 
combinations  are  association  rules.  The  task  involves  two  issues:  First  is  to  select 
the  combination,  which  we  treat  at  Section  4.1.  The  second  issue  is  how  to 
evaluate  if  the  combination  is  large;  this  is  treated  at  Section  4.2  We  consider 
both  together  at  Section  4.3. 

4.1  The  Outer  Loop:  Generating  Potential  Combinations  to  Test 
for  Association  Rules 

One  obvious  way  to  reduce  the  complexity  is  to  remove  those  granules  in  the 
quotient  set,  Qn,  that  do  not  meet  the  percentage,  s%.  These  granules  will  be 
called  lean  granules.  It  should  be  clear  any  removal  of  those  lean  granules,  L,  also 
means  the  removal  of  any  combinations  that  include  those  lean  granules,  L.  With 
that  in  mind,  generated  large  combinations  must  have  all  its  sub-combinations 
large  too. 

-  ^-combinations  should  be  generated  by  large  q  —  1-combinations. 

-  All  sub  ( q  —  1) combinations  of  ^-combination  are  large. 

This  step  is  essential  the  same  as  Apriori  Algorithms. 

4.2  The  Inner  Loop:  The  Computation  Cost  of  Determining  a 
Generated  Combination  Is  an  Association  Rule 

In  this  section,  we  will  explain  how  to  identify  a  ^-combination  is  large;  and  we 
will  compare  our  approach  with  Apriori. 


Counting  bits.  We  will  explain  the  logarithmic  approach,  counts  the  bits  by 
partitioning  the  word.  For  each  word,  it  does  the  count  by  every  2  bits,  then  by 
every  4  bits,  next  by  every  8  bits,  and  finally  by  every  16  bits.  Here  is  the  sample 
(we  use  pseudo  C  syntax). 

1.  2  bits  count:  odd  bits  are  added  to  the  preceding  even  bits:  Let  B  be  the  bit 
pattern  and  Counts  is  a  variable  holding  the  ”new”  bit  pattern. 

Count 32  =  (B&0x55555555)  +  ((B  »  I)&0x55555555); 

2.  4  bits  count:  odd  2-bits  are  added  to  even  2-bits  (each  2-bit  represents  the 
sum  of  even  and  odd  bits) 


Count 32  =  (Counf32&0x33333333)  +  {(Count 32  »  2)&0x33333333); 
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3.  8  bits  count:  odd  4-bits  are  added  to  even  4-bits  (  each  4-bit  represents  the 
sum  of  even  and  odd  2-bits) 

Count32  =  (Count32&i0x0F0F0F0F)  +  ((Count32  »  A)&:0x0F0F0F0F)-, 

4.  16  bits  count;  odd  8-bits  are  add  to  even  8-bits 

Counts  =  (Count328xOxOOFFOOFF)  +  ((Count, 32  »  8)&z0x00F  FOOFF)] 

5.  32  bits  count;  odd  16-bits  are  add  to  the  even  16-bits 

TotalCount  =  ( Count32fo0x0000FFFF )  +  {Counts  >>  16); 

The  first  four  statements  take  28  instructions,  4  *  (3  move  +  2  and  +  1  add 
+  1  shr),  and  the  last  statement  takes  7  instructions,  (3  move  +  1  add  +  2  add, 
+  1  shr).  The  total  for  this  approach  is  35  instructions  to  count  one  32-bit  word. 
This  can  be  faster  than  the  usual  shift  method  in  the  order  of  magnitude. 


The  Instructions  Counts.  The  instruction  count  in  computing  a  g-combina- 
tion  being  large  is  simply  the  AND  instructions  between  elements  of  quotient 
sets  in  binary  form  and  the  instructions  to  count  the  number  of  l’s  in  the  result. 
Since  computers  process  the  size  of  a  word  at  a  time,  the  number  of  instructions 
is  approximately  the  following: 


1.  Set  length  —  The  length  of  the  bit  string  representing  a  granule.  This  is  the 
cardinality  of  the  universe,  Card((7). 

2.  Set  w  =  Number  of  bits  in  a  word;  in  this  paper,  w  is  32. 

3.  Set  count  =  Number  of  instructions  per  word  = 
length/w  *  ((q  —  1  )AND  operations  +  cost  of  counting  bits.) 

4.  The  total  instruction  count  to  determine,  in  the  worst  case,  whether  a  q- 
combination  is  large  is  the  following: 

count  =  Number  of  instructions  per  word  —  length/w  *  {{q  —  1)  +  35)  in¬ 
structions 

For  example,  for  a  combination  of  length  10  with  220  tuples  (a  ’’million” 
tuples)in  the  relation,  1,441,792  instructions  will  be  executed. 

(220/32)  *  ((10  -  1)  +  35)  =  215  *  (44)  =  1, 441, 792  instructions. 

Note:  We  may  stop  once  the  count  meets  s%  of  the  data. 


4.3  Inner  and  Outer  Loop:  Grouping  Combinations 

As  explained  in  Section  4.1,  there  are  two  strategies  in  carrying  out  the  compu¬ 
tations:  one  is  the  full  computations  of  the  intersection;  the  other  stop  at  the 
s%  condition. 
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1.  For  g  =  l,  the  inner  loop  identifies  large  granules,  say  l±,  I2,  . . . ,  /„,  by  coun¬ 
ting  the  bits  using  the  method  of  Section  4.2;  note  that  the  counting  will 
stop  at  s%.  For  uniformity  in  terminology,  they  will  be  referred  to  as  large 
1-combination. 

2.  For  q  =  2,  the  inner  loop  computes  the  2-combination  by  AN  Ding  two  1- 
combinations  and  counting  bit  patterns  of  the  result,  say,  l\  n  1%.  There  are 
three  approach  in  this  computations: 

a)  full  computation:  We  compute  the  intersection  and  save  the  results.  One 
needs  memory  and  storage  management,  since  the  intersection  can  not 
be  kept  in  main  memory. 

b)  partial  computation:  The  computation  of  the  intersection  stop  at  the 
point  s%  reaches.  No  intersections  of  bit  patterns  are  saved.  The  disad¬ 
vantage  is  we  may  do  some  computations  repeatedly.  But  in  the  case  that 
s  is  well  smaller  than  the  length  of  bit  strings,  this  is  right  approach. 

c)  partial  computation  and  partial  save:  Compute  at  s%  and  save  the  par¬ 
tial  intersections.  This  could  delay  the  storage  management  to  a  much 
later  stage,  so  one  might  be  able  to  avoid  it  in  not  extremely  large  data¬ 
bases. 

3.  For  general  q,  the  loop  computes  the  ^-combination  by  AN  Ding  two  q  —  1- 
combinations  and  counting  the  bit  pattern  of  the  result.  Again,  we  can  use 
one  of  the  three  approaches. 

5  Computational  Data 

The  relation  consists  of  128K  rows  =  131072,  16  Columns,  the  support  requires 

8192,  and  memory  is  10  megabytes: 


Table  5.  Experimental  Results 


Length  of 
combi¬ 
nation 

#  of 

Candidates 

Association 

rules 

Granule(Full 

Computation 

Granule 

Partial 

Apriori 

Hybrid 

Apriori 

199 

Apriori 

Tid 

1 

101 

1.110s 

1.110s 

3.813s 

2 

381 

26.078s 

25.765s 

mm 

3 

1419 

246 

46.609s 

37.938s 

362.657s 

4 

137 

39 

0.141s 

0.141s 

30.187s 

26.312s 

5 

1 

1 

0.000s 

0.000s 

0.343s 

74.094s 

0.360s 

6 

0 

0 

0.000s 

0.000s 

0.000s 

0.000s 

0.000s 

4.938 

4.828s 

100.078s 

169.094s 

429.813 

The  program  for  Apriori,  AporiTid  and  AprioriHybrid  are  our  honest  im¬ 
plementations  of  the  algorithms  in  [6],  [7].  In  the  implementation,  we  use  some 
buffer  scheme  to  speedup  read/write  for  all  algorithms. 
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Comparison  with  Apriori.  In  the  outer  loop,  Apriori  and  granular  approach 
is  the  same.  For  outer  loop,  Apriori  goes  through  each  transaction  and  use  subset 
function  to  verify,  if  a  ^-combination  belongs  to  that  transaction  (for  a  fixed  q 
and  fixed  combination).  It  takes  more  than  one  instruction  to  hash  through  the 
hash  tree.  For  granular  approach,  it  involves  only  AND- operations,  computed  q 
times,  which  takes  only  q/ 32  (32=wordsize)  instruction(s).  So  our  computation 
is  much  faster.  This  is  from  computational  point  of  view.  From  I/O  point  of 
view,  the  machine  model  often  transforms  database  to  a  more  compact  form,  so 
the  reading  of  total  databases  takes  less  time  too.  We  should  like  to  point  out 
that  if  attribute  values  are  continuous  or  too  many  values  one  has  to  perform 
discretization,  partitioning,  granulation  on  the  (active)  attribute  domain  first. 
The  so  called  concept  hierarchy  method  can  be  applied  [1], 

6  Conclusions 

It  is  clear  why  granular  approach  is  faster  than  Apriori  in  our  experiments.  It  is 
less  clear  why  it  is  still  faster  than  AprioriTid  and  AprioriHybrid;  it  may  be  due 
to  the  fact  that  the  ”Tid  and  Hybrid”  algorithms  only  improve  the  Apriori  at 
the  latter  phases  and  those  phases  do  not  play  the  decisive  role.  Further  analysis 
and  experiments  are  necessary. 
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Abstract.  A  set  of  association  rules  is  called  representative  if  it  is  a 
minimal  set  of  rules  from  which  all  association  rules  can  be  generated. 
The  existing  algorithms  for  generating  representative  association  rules 
use  all  the  frequent  itemsets  as  input.  In  this  paper,  we  present  a  new 
approach  for  generating  representative  association  rules  that  uses  only 
a  subset  of  the  set  of  frequent  itemsets  called  frequent  closed  itemsets. 
This  results  in  a  big  reduction  in  the  input  size  and,  therefore,  faster 
algorithms  for  generating  representative  association  rules.  Our  approach 
uses  ideas  from  formal  concept  analysis  to  find  frequent  closed  itemsets. 


1  Introduction 

Mining  association  rules  is  an  important  data  mining  problem  that  was  intro¬ 
duced  in  [1] .  The  problem  was  first  defined  in  the  context  of  the  market  basket 
data  to  identify  customers’  buying  habits.  For  example,  it  is  of  interest  to  a  su¬ 
permarket  manager  to  find  out  that  80%  of  the  customers  who  buy  bagels  also 
buy  cream-cheese  and  5%  of  all  customers  buy  both  bagels  and  cream-cheese. 
Here  the  association  rule  is  the  rule  baggies  =4*  cream-cheese,  80%  is  the  confi¬ 
dence  of  the  rule  and  5%  is  its  support.  There  has  been  a  great  deal  of  research 
in  developing  efficient  algorithms  for  discovering  association  rules  that  satisfy 
user-specified  constraints  such  as  minimum  support  and  minimum  confidence  [2, 
3,12,14]. 

The  number  of  discovered  association  rules  is  usually  huge  which  makes  it 
very  difficult  for  an  expert  to  analyse  the  rules  and  identify  the  interesting  ones. 
Lately,  there  has  been  an  interest  in  identifying  the  association  rules  that  are  of 
special  importance  to  a  user  and  in  decreasing  the  number  of  discovered  asso¬ 
ciation  rules  [4,9,10,13],  Most  of  these  approaches  introduce  additional  measures 

*  This  research  was  supported  in  part  by  the  Army  Research  Office,  Grant  No. 
DAAH04-96-1-0325,  under  DEPSCoR  program  of  Advanced  Research  Projects 
Agency,  Department  of  Defense  and  by  the  U.S.  Department  of  Energy,  Grant  No. 
DE-FG02-9  7ER1220. 

Z.W,  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  495-504,  2000. 
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for  interestingness  of  a  rule  and  prune  the  rules  that  do  not  satisfy  the  addi¬ 
tional  measures.  A  set  of  representative  association  rules,  on  the  other  hand,  is 
a  minimal  set  of  rules  from  which  all  association  rules  can  be  generated.  The 
number  of  representative  association  rules  is  much  smaller  than  the  number  of 
all  association  rules.  Furthermore,  we  do  not  need  any  additional  measure  for 
determining  the  representative  association  rules. 

Algorithms  for  discovering  representative  association  rules  are  given  in  [7, 
8].  These  algorithms  use  all  the  frequent  itemsets  to  find  the  representative  as¬ 
sociation  rules.  In  this  paper,  we  present  a  different  approach  for  generating 
representative  association  rules.  Our  approach  uses  only  a  subset  of  the  set  of 
frequent  itemsets  which  we  call  frequent  closed  itemsets.  This  results  in  reducing 
the  input  size  and,  therefore,  results  in  faster  algorithms  for  generating  repre¬ 
sentative  association  rules.  We  use  ideas  from  formal  concept  analysis  to  find 
the  frequent  closed  itemsets. 

2  Association  Rules  and  Representative  Association  Rule 

The  problem  of  discovering  association  rules  was  first  introduced  in  [1].  It  can  be 
described  formally  as  follows  [1,2].  Let  X  =  {«i,  i2,  •  •  •  Am}  be  a  set  of  m  literals, 
called  items.  Let  V  =  be  a  database  of  n  transactions  where 

each  transaction  is  a  subset  of  I.  Any  subset  of  items  X  is  called  a  fc-itemset  if 
the  number  of  items  in  X  equals  k.  The  support  of  an  itemset  X,  denoted  by 
sup(X),  is  the  percentage  of  transactions  in  the  database  V  that  contain  X.  An 
itemset  is  called  frequent  if  its  support  is  greater  than  or  equal  to  a  user  specified 
threshold  value. 

An  association  rule  r  is  a  rule  of  the  form  X  =>  Y  where  both  X  and  Y 
are  nonempty  subsets  of  I  and  X  fl  Y  =  0.  X  is  called  the  antecedent  of  r  and 
Y  is  called  its  consequent.  The  support  and  confidence  of  the  association  rule 
r  :  X  =>  Y  are  denoted  by  sup(r)  and  conf(r),  respectively,  and  defined  as 

sup(r)  =  sup(X  U  Y)  and  conf(r)  —  sup(X  U  Y)/sup(X). 

Support  of  r  :  X  =>  Y  is  simply  a  measure  of  its  statistical  significance  and 
confidence  of  r  is  a  measure  of  the  conditional  probability  that  a  transaction 
contains  Y  given  that  it  contains  X. 

The  task  of  the  association  data  mining  problem  is  to  find  all  association 
rules  with  support  and  confidence  greater  than  user  specified  minimum  support 
and  minimum  confidence  threshold  values.  Throughout  this  paper,  we  will  use 
the  notation  AR{s,c)  to  denote  the  set  of  all  association  rules  with  minimum 
support  s  and  minimum  confidence  c.  We  also  write  AR  instead  of  AR(s,c), 
when  s  and  c  are  understood. 

The  number  of  association  rules  is  usually  huge.  Representative  association 
rules  (RAR)  were  introduced  in  [7]  to  overcome  this  problem  and  to  reduce  the 
number  of  rules  presented  to  a  user.  The  user  can  mine  around  the  RAR.  For 
example,  the  user  may  ask  to  be  presented  with  all  the  rules  that  are  covered 
(or  represented)  by  a  certain  rule  of  interest  to  him/her.  Informally,  the  cover 
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of  a  rule  r  :  X  =>  Y,  denoted  by  C(r),  is  the  set  of  association  rules  that  can  be 
generated  from  r.  Formally, 

C(r  :  X  =>Y)  =  {XUU  \U,V  CY,  U  n  V  =  0,  and  V  ^  0}. 

An  important  property  of  the  cover  operator  is  that  if  an  association  rule  r 
has  support  s  and  confidence  c,  then  every  rule  r'  £  C(r)  has  support  at  least 
s  and  confidence  at  least  c  [7].  This  property  means  that  C  is  a  well  defined 
inference  operator  for  association  rules. 

Using  the  cover  operator,  a  set  of  representative  association  rules  with  mini¬ 
mum  support  s  and  minimum  confidence  c,  RAR(s,c),  is  defined  as  follows: 

RAR(s,c)  =  {r  £  AR(s,c )  |  fir'  £  AR(s,c),  r  ^r'  and  r  £  C'(r')}. 

That  is,  a  set  of  representative  association  rules  is  a  least  set  of  association  rules 
that  cover  all  the  association  rules  and  from  which  all  association  rules  can  be 
generated.  Clearly,  AR(s,c )  =  [J{C(r)  I  r  G  RAR(s,c)}. 

Let  length  of  X  =>  V  be  the  number  of  items  in  X  U  Y.  The  following  are 
important  properties  of  RAR  [7,8]: 

Property  1.  Let  r  :  X  =>Y  and  r'  :  X'  =>  Y'  be  two  different  association  rules, 
then 

1.  If  r  is  longer  than  r' ,  then  r  $  C(r'). 

2.  If  r  is  shorter  than  r' ,  then  r  £  C(r')  iff  X  UY  cI'uF  and  X  D  X' . 

3.  If  r  and  r'  are  of  the  same  length,  then  r  £  C{r')  iff  X  U  Y  —  X'  U  Y'  and 
X  D  X'. 

Property  2.  Let  r  :  X  =>  Z\X  £  AR(s,  c)  and  let  maxSup  = 
max({sup(Z')  |  Z  C  Z'  C  1}  U  {0}).  Then,  r  £  RAR(s,c)  if  the  following  two 
conditions  are  satisfied: 

i.  maxSup  <  s  or  maxSup/ sup(X)  <  c. 
ii.  flX',  0  C  X'  c  X  such  that  X'  =4>  Z\X'  £  AR(s,  c). 

The  first  condition  guarantees  that  r  does  not  belong  in  the  cover  of  any  as¬ 
sociation  rule  with  length  greater  than  the  length  of  r.  The  second  condition 
guarantee  that  r  does  not  belong  in  the  cover  of  any  association  rule  that  has 
the  same  length  as  r. 

Property  3.  Let  I^IcZcZ'CJ  and  sup(Z)  =  sup(Z').  Then,  there  is  no 
rule  r  :  X  =>  Z\X  £  AR(s,c)  such  that  r  £  RAR(s,c). 

Property  3  holds  because  r  £  C(X  =>  Z'\X).  The  above  properties  led  to  the 
development  of  the  algorithms  GenAllRepresentatives  and  FastGenAllRepresen- 
tatives  for  discovering  representative  association  rules  [7,8],  Both  algorithms  use 
all  frequent  itemsets  generated  from  applying  the  Apriori  algorithm  to  the  da¬ 
tabase  V  [2].  Our  approach  is  different;  we  use  only  a  subset  of  the  frequent 
itemsets  which  we  call  frequent  closed  itemsets.  Frequent  closed  itemsets  are 
found  by  using  methods  from  formal  concept  analysis.  In  the  next  section  we 
develop  the  necessary  theory  for  that. 
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3  Closed  Itemsets 

In  this  section,  we  develop  theoretical  results  that  lead  to  the  development  of 
our  algorithm  for  RAR  generation.  These  results  are  directly  related  to  formal 
concept  analysis  [15].  Our  notion  of  a  closed  itemset  is  similar  to  that  of  a 
concept.  Informally,  a  concept  is  a  pair  of  two  sets:  set  of  objects  (transactions 
or  itemsets)  and  set  of  features  (items)  common  to  all  the  objects.  Using  the 
framework  of  formal  concept  analysis,  concepts  are  structured  in  the  form  of  a 
lattice  called  the  concept  lattice.  The  concept  lattice  has  proved  to  be  a  useful 
tool  for  knowledge  representation  and  knowledge  discovery  [6]. 

Definition  1.  A  data  mining  context  is  defined  as  a  triple  (T,X,  R)  where  T  is 
a  set  of  transactions,  I  is  a  set  of  items  and  R  C  T  x  I. 

A  data  mining  context  is  a  formal  definition  of  a  database.  The  set  T  is  the  set 
of  all  transactions  in  the  database  and  the  set  I  is  the  set  of  all  items  in  the 
database.  For  f  G  T  and  i  £l  we  write  (t ,  i)  G  R  to  mean  that  the  transaction 
t  contains  the  item  i.  An  example  of  a  data  mining  context  is  shown  in  Table  1 
where  an  X  is  placed  in  the  fth  row  and  ith  column  to  indicate  that  ( t ,  i)  G  R. 
This  example  is  generated  from  the  database  given  in  [7]  which  we  use  in  this 
paper  for  comparison. 


Table  1.  Example  of  a  Data  Mining  Context 
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Two  dual  mappings  a  and  /3  are  defined  between  the  power  sets  of  T  and  I 
as  follows. 

Definition  2.  Let  (T,X,R)  be  a  data  mining  context,  X  C  T,  and  Y  C  X. 
Define  the  mappings  a,  f3  as  follows: 

0  :  2t  ->  21,  0{X)  =  {i  G  X  |  (f,  i)  E  R  V  t  G  X}, 

a  :  21  —>  2T ,  a(Y)  =  {t  G  T  |  (t,  i)  G  R  V  i  G  Y}. 

The  mapping  0{X)  associates  with  X  the  set  of  items  that  are  common  to  all 
the  transactions  in  X.  Similarly,  the  mapping  a(Y)  associates  with  Y  the  set 
of  all  transactions  having  all  the  items  in  Y.  Intuitively,  @(X)  is  the  maximum 
set  of  items  shared  by  all  transactions  in  X  and  a(Y)  is  the  maximum  set  of 
transactions  possessing  all  the  attributes  in  Y . 
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Example  1.  Consider  the  database  presented  in  Table  1.  Let  X  —  {^1,^2}  and 
Y  =  {A,  B,  C}.  Then ,  0(X)  =  {A,B,C,D,E},  a(Y)  =  {h ,t2,t3},  a(/3(X))  = 
a({A,B,C,D,E})  =  {ti,t2,t3},  and  (3(a(Y))  =  P{{h,t2,t3})  =  {A,B,C,D, 
E}. 

It  is  clear  from  this  example  that,  in  general,  (3(a(Y ))  ^  Y  where  Y  is  an  itemset. 
This  leads  to  the  following  definition. 

Definition  3.  An  itemset  Y  that  satisfies  the  condition  f3(a(Y))  =  Y  is  called 
a  closed  itemset. 

Closed  itemsets  are  important  because  all  members  of  the  concept  lattice  of 
a  data  mining  context  satisfy  the  condition  (3(a(Y))  —  Y  [15].  This  step  of 
considering  only  closed  itemsets  can  be  considered  as  a  first  step  in  pruning  the 
itemset  lattice. 

Example  2.  Let  Y  =  {A,  B,C,D,  E}.  (3(a(Y))  =  Y.  Therefore,  Y  is  a  closed 
itemset.  On  the  other  hand,  the  itemset  {A,B,C}  given  in  Example  1  is  not 
closed  because  (3(a({A,B,C }))  =  {A,B,C,  D,E}  7^  {A,  B,C}. 

The  concept  lattice  can  be  pruned  more  by  considering  only  closed  itemsets  with 
support  greater  than  minimum  support  which  we  call  frequent  closed  itemsets. 
This  leads  to  the  following  definition. 

Definition  4.  A  frequent  closed  itemset  is  a  closed  itemset  which  is  also  fre¬ 
quent.  That  is,  it  has  support  greater  than  or  equals  to  user-specified  value  for 
minimum  support. 

4  Algorithms 

Our  approach  is  to  first  generate  the  set  of  all  frequent  closed  itemsets,  FCI, 
and  then  to  use  FCI  to  generate  the  set  of  all  representative  association  rules. 

4.1  Generating  Frequent  Closed  Itemsets 

Let  V  —  ( T ,  I,  R)  be  a  database  mining  context.  The  algorithm  we  use  to  gene¬ 
rate  FCI  is  a  slight  modification  of  the  Close  algorithm  mentioned  in  [11]  which 
we  call  Close-FCI.  Both  algorithms  are  similar  to  the  Apriori  algorithm  [2] . 

Assume  that  the  items  in  I  are  sorted  in  lexicographic  order.  The  data  struc¬ 
ture  used  consists  of  two  sets.  Set  of  candidate  frequent  closed  itemsets,  FCC, 
and  set  of  frequent  closed  itemsets,  FC.  The  notations  FCCi  and  FCt  are  used 
to  indicate  candidate  frequent  closed  itemsets  and  frequent  closed  itemsets  of 
size  i,  respectively.  Each  element  in  FCCi  and  FCi  has  three  components.  An 
itemset  component,  a  closure  component  and  a  support  component. 

Closure  of  an  itemset  X  C  I,  denoted  by  closure(X),  is  the  smallest  closed 
itemset  containing  X  and  is  equal  to  the  intersection  of  all  itemsets  containing  X. 
It  is  also  shown  in  [11]  that  Support(X)  =  Support(Closure(X)).  The  Close-FCI 
algorithm  is  given  below. 
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Algorithm  1  Close-FCI(D) 

1)  FCCi.itemsets  ={l-itemsets}; 

2)  for  (k  =  1;  FCCk  ^  0;  k++)  do  begin 

3)  forall  X  £  FCC k  do  begin 

4)  X. closure  =  0; 

5)  X.support  =  0; 

6)  end  forall 

7)  FCCk  =  Generate-Closures(FC'Cfc); 

8)  forall  candidate  closed  itemset  X  £  FCCk  do  begin 

9)  if  (  (X.support  >  minSupport)  and  ( X.closure  g  FCk)  )  then 

10)  FCk  <-  FCk  U  {xy, 

11)  end  if 

12)  FCCk+  i  =  Generate-Candidates(.FC'fc); 

13)  endfor 

14)  return  Ui=Ti  {FCi. closure  and  FCi. support}; 

First,  the  algorithm  initializes  the  itemsets  in  FCC  j  to  the  items  in  the  database 
which  does  not  require  any  database  pass.  Then  in  iteration  k  of  the  main  for 
loop  of  the  algorithm,  the  closure  and  support  of  each  itemset  in  FCCk  are 
initialized.  The  algorithm  then  finds  candidate  frequent  closed  itemsets  of  size 
k.  FCCk,  in  line  7.  Frequent  closed  itemsets,  FCk,  are  found  in  the  next  step 
where  the  minimum  support  threshold,  minSupport,  is  used  to  prune  out  the 
infrequent  itemsets.  Finally,  in  line  14  the  algorithm  generate  candidates  FCI 
of  size  k  +  1,  FCCk+i- 

Closure  of  an  itemset  I  C  I,  is  by  definition  the  smallest  closed  itemset 
containing  X  which  is  equal  to  the  intersection  of  all  frequent  itemsets  containing 
X.  Therefore,  closure(X)  is  found  using  the  following  formula 

closure(X)  =  f|{/3(t)|XC/3(t)} 

t€T 

which  is  incorporated  in  the  following  algorithm.  This  algorithm  requires  one 
database  pass  to  find  closure  of  all  elements  in  FCCk- 

Algorithm  2  Generate- Closures  (FCCk) 

1)  forall  transactions  t  £  T  do  begin 

2)  f3t  =  {X  £  FCCk  [  X  C  /3(f)};  //all  itemsets  that  are  contained  in  /3(f) 

3)  forall  itemsets  X  €  /3(f)  do  begin 

4)  if  (  X.closure  =  0  )  then 

5)  X.closure  <—  /3(f); 

6)  else 

7)  X.closure  <—  X.closure  fl  /3(f); 

8)  end  if 

9)  X.Support++; 

10)  end  forall 

11)  end  forall 

12)  Return  U{-^  €  FCCk  \  X.closure  ^  0}; 
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The  algorithm  Generate-Candidates(FCk)  that  finds  candidate  closed  item- 
sets  of  size  k  +  1  uses  the  result  of  the  following  property  [11]. 

Property  /.  Let  X  be  an  itemset  of  length  k  and  X  =  {Xi,X2,  ■  ■  ■  ,Xm}  be  a 
set  of  ( k  —  l)-subsets  of  X  where  Ux  e*  =  X.  If  3Xi  6  X  such  that  X  C 
closure(Xi),  then  closure(X)  =  closure[Xf) . 

This  property  means  that  the  itemset  X  results  in  redundant  computations  of 
frequent  closed  itemsets  because  closure(X)  which  is  equal  to  closure(Xl)  is 
already  generated.  Therefore,  X  can  be  removed  from  FCC^+i-  The  Generate- 
Candidates(FCk)  is  similar  to  the  AprioriGen  algorithm  that  was  first  given  in 
[2]  except  for  the  second  pruning  step  at  line  6. 

Algorithm  3  Generate-Candidates(FCk) 

1)  insert  into  FCC^+i)- itemset 

2)  select  X.itemseti,  X.itemset2,  •  ■  • ,  X. itemset,  Y.itemsetfc 

3)  from  FCk- itemset  X,  FCk  .itemset  Y 

4)  where  X.itemseti  =Y.itemseti  A  X.itemset2=Y.itemset2  A  A 

X.itemsetfc_i=Y.itemsetfc_i  A  X.itemsetfc  <Y.itemset/c; 

//prune  all  supersets  of  infrequent  itemsets 

5)  delete  all  itemsets  X  €  FCC^+ 1). itemset  where  some  k-subset  of  X  is 

not  in  FCk', 

//prune  all  itemsets  with  closures  already  generated 

6)  delete  all  itemsets  X  €  FCC^+i)  .itemset  where  the  closure  of  some 

k-subset  of  X  contains  X; 

7)  Return  |J{^  €  FCC^+i)}‘, 

The  Glose-FCI  algorithm  requires  one  database  pass  in  each  iteration.  That 
pass  is  needed  in  the  Generate- Closure  algorithm. 

Example  3.  Applying  the  Close-FCI  on  the  database  represented  in  Table  1  for 
minimum  support  s  =  3/5  =  0.6  and  minimum  confidence  c  —  0.75,  the  following 
frequent  closed  itemsets  are  found: 

FC  =  {ABODE,  BCDE ,  ABE,  BE} 

with  support  3/5,  4/5,  4/5,  and  5/5,  respectively.  On  the  other  hand,  the  Apri- 
ori  algorithm  produced  31  different  frequent  itemsets  (the  nonempty  subsets  of 
ABCDE )  for  the  same  values  of  s  and  c. 

4.2  Generating  Representative  Association  Rules 

In  this  section,  we  present  an  algorithm  for  generating  representative  association 
rules  which  we  call  Generate-RAR.  Generate-RAR  takes  as  input  the  set  of  all 
FCI  and  produces  the  set  of  all  RAR.  Generate-RAR  is  a  modification  of  the 
FastGenAURepresentatives  given  in  [8]  and  it  uses  Properties  1-3  from  Section 
2.  It  also  uses  the  result  that  for  an  itemset  X,  sup(X)  =  sup(closure(A))  which 
was  mentioned  in  Section  4.1.  The  Generate-RAR  algorithm  is  given  below. 
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Algorithm  4  Generate-R A R  (all  frequent  closed  itemsets  FC) 

1)  c  i —  minConfidence;  //user  specified  value  for  minimum  confidence 

2)  RAR  <—  0;  //initialize  set  of  RAR 

3)  k  i —  0 

//split  frequent  closed  itemsets  according  to  size 

4)  forall  X  £  FC  do  begin 

5)  FCm  <-  FC\X\  U  {A}; 

6)  if  {k  <  |X|)  then  k  <-  |X|; 

7)  end  forall 

8)  for  (i  <—  k\i  >  1; i - )  do 

9)  forall  Z  £  FCi  do  begin 

10)  maxSup  =  max({sup(Z')  |  Z  C  Z'  £  FC}  U  {0}); 

11)  if  (Z. support  A  maxSup)  then  begin  //see  Property  3 

12)  Ai  =  {{Z[l]},  {Z[2]},  ■  ■  ■  ,{Z[i]}};  //create  1-antecedents 
//start  loop2 

13)  for  (j  =  1;  (Aj  A  0)  and  ( j  <  i)-j  +  +)  do  begin 

14)  forall  X  €  Aj  do  begin 

15)  Y  <—  smallest  closed  itemset  containing  X ; 

16)  X.support  =  Y.support; 

17)  / /check  if  X  =>  Z\X  is  a  representative  association  rule 

18)  if  (Z. support /X.support  >  c  and 

max  Sup /X.support  <  c)  then 

19)  RAR  <-  RAR  U  {X  =>  Z\X}-, 

20)  A,-  =  Aj  \  {A}; 

21)  end  if 

22)  end  forall 

23)  Aj+i  <—  AprioriGen(Aj); 

24)  end  for  //end  loop2 

25)  end  if 

26)  end  forall 

27)  Return  RAR; 

First,  RAR  is  initialized  to  be  empty.  Then,  each  frequent  closed  itemset  A 
of  length  i  is  added  to  FCi  and  the  length  of  the  maximal  closed  itemsets  k  is 
found.  Line  8  controls  the  generation  of  representative  association  rules.  First, 
the  largest  rules  (of  size  k )  are  generated  and  added  to  RAR.  Next,  representative 
association  rules  of  size  ( k  —  1)  are  generated  and  added  to  RAR  and  so  on. 
Finally,  representative  association  rules  of  size  2  are  generated  and  added  to 
RAR.  The  generation  of  association  rules  of  size  i  is  controlled  in  lines  9  through 
26  as  follows: 

Let  Z  be  a  frequent  closed  itemset  of  size  i.  First,  maxSup  is  found  in  line 
10.  If  there  is  no  superset  of  Z,  maxSup  will  be  assigned  the  value  zero.  If 
maxSup  has  the  same  value  as  Z. support,  then  according  to  Property  3,  no 
representative  association  rule  will  be  generated  from  Z.  Otherwise,  the  process 
of  generating  representative  association  rules  from  Z  starts.  First,  the  set  Ai  is 
assigned  all  possible  1-itemset  antecedents  of  Z .  The  loop  in  lines  13  through 
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24  controls  the  generation  of  representative  association  rules  with  antecedents 
of  length  j.  All  possible  antecedents  X  G  Aj  are  considered.  Support  of  X 
(which  is  equal  to  support(closure(X)))  is  found.  X  — >•  Z\X  is  a  valid  represen¬ 
tative  association  rule  if  its  confidence  is  greater  then  or  equal  to  c  and  if  it 
satisfies  the  second  condition  of  property  2  .i  in  which  case  X  will  be  removed 
from  Aj .  After  all  representative  association  rules  with  antecedents  of  length  j 
are  generated  from  Z,  Aj  may  not  be  empty.  The  AprioriGen  function  is  called 
with  argument  Aj  to  find  antecedents  of  length  j  + 1.  The  AprioriGen  function 
is  given  in  [2]  and  it  consists  of  lines  1  through  5  of  Algorithm  3.  Condition  ii 
of  Property  2  is  satisfied  by  making  sure  that  no  itemset  in  Aj+i  is  a  superset 
of  an  antecedent  of  an  already  generated  representative  association  rule  which 
is  taking  care  of  in  line  20. 

Example  4 ■  Running  the  Generate- RAR  algorithm  using  as  input  the  frequent 
closed  itemsets  produced  in  the  previous  example,  the  following  representative 
association  rules  are  generated: 

RAR  =  {A  ->  BCDE,  C  ->  ABDE,  D  ->•  ABCE,  B  ->  CDE, 

E  BCD ,  B  ->  AE,  E  -»  AB}. 

They  are  the  same  rules  generated  from  the  FastGenAllRepresentatives  in  [8], 

5  Conclusion 

This  paper  presents  a  new  approach  for  generating  representative  association 
rules  using  frequent  closed  itemsets  which  constitute  only  a  subset  of  the  set  of  all 
frequent  itemsets.  Our  approach  results  in  a  big  reduction  in  the  input  size  and 
thus  faster  algorithms  for  generating  representative  association  rules.  We  also 
presented  a  new  algorithm,  called  Generate-RAR  for  generating  representative 
association  rules. 

Traditional  association  mining  algorithms  first  generate  all  frequent  itemsets 
and  then  use  frequent  itemsets  to  generate  all  the  association  rules.  For  future 
work,  we  will  investigate  if  FCI  can  be  used  directly  to  generate  all  the  associa¬ 
tion  rules.  We  will  also  investigate  if  our  approach  for  finding  RAR  using  FCI 
can  result  in  faster  generation  of  all  the  association  rules  than  the  traditional 
algorithms. 
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Abstract.  A  notion  of  legitimate  definitions  of  support  and  confidence  under  in¬ 
completeness  is  defined.  Properties  of  generic  legitimate  definitions  of  support 
and  confidence  are  investigated.  We  show  that  in  the  case  of  incompleteness  le¬ 
gitimate  association  rules  can  be  derived  from  legitimate  representative  rules  by 
the  cover  operator.  It  is  proved  that  the  minimum  condition  maximum  conse¬ 
quence  association  rules  under  incompleteness  constitute  a  subset  of  representa¬ 
tive  rules  of  the  same  type.  Algorithms  for  generating  association  rules  under 
incompleteness  are  offered. 


1  Introduction 

The  problem  of  association  rules  discovery  was  introduced  in  [1]  for  sales  transaction 
database.  The  problem  of  data  incompleteness  did  not  occur  for  a  transaction  database, 
however,  it  is  often  unavoidable  in  relational  databases.  Missing  data  may  result  from 
errors,  measurement  failures,  changes  in  the  database  schema  etc. 

Incompleteness  of  the  data  in  the  database  introduces  a  confusion  of  how  to  treat 
requests  for  a  given  user-specified  support  and  confidence  of  rules.  We  cannot  evalu¬ 
ate  exact  values  of  support  and  confidence.  Instead,  we  can  evaluate  optimistic  and 
pessimistic  support  and  confidence  of  a  rule.  For  a  marginal  incompleteness  we  could 
expect  that  the  difference  between  optimistic  and  pessimistic  parameters  is  not  essen¬ 
tial.  If  however  the  incompleteness  grows,  the  gap  between  optimistic  and  pessimistic 
parameters  will  also  grow.  We  could  therefore  foresee  that  the  user  is  also  interested 
in  "expected"  support  and  confidence. 

Data  incompleteness  in  the  context  of  association  rules  was  addressed  in  [5].  It  was 
offered  in  [5]  how  to  compute  pessimistic  and  optimistic  estimations  of  support  and 
confidence  of  an  association  rule.  In  [6]  a  set  of  properties  that  characterize  a  legiti¬ 
mate  approach  to  incompleteness  has  been  proposed.  An  example  of  a  legitimate 
probabilistic  approach  was  presented  in  [6]  as  well  as  examples  of  popular  approaches 
ignoring  missing  values  that  turn  out  not  to  be  legitimate. 

In  this  paper,  we  investigate  properties  of  generic  legitimate  definitions  of  support 
and  confidence  in  the  case  of  incomplete  databases.  We  will  show  that  in  case  of  in¬ 
completeness  legitimate  association  rules  can  be  derived  from  legitimate  representa¬ 
tive  rules  by  the  cover  operator  [3],  We  also  prove  that  the  minimum  condition  maxi¬ 
mum  consequence  association  rules  of  any  type  constitute  a  subset  of  representative 
rules  of  the  same  type.  Finally,  we  offer  the  algorithms  generating  association  rules 
from  incomplete  databases. 
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2  Association  Rules  in  Complete  Relational  Databases 

Let  us  consider  a  table  D  =  (O,  AT),  where  O  is  a  non-empty  finite  set  of  tuples  and 
AT  is  a  non-empty  finite  set  of  attributes ,  such  that  a:  O  — >  Va  for  any  aeAT,  where  Va 
denotes  the  domain  of  a.  Any  attribute-value  pair  (a,v),  where  aeAT  and  ve  Va  will  be 
called  an  item.  A  set  of  items  will  be  called  itemset.  Support  of  an  itemset  X  is  denoted 
by  sup(X)  and  defined  as  the  number  (or  the  percentage)  of  tuples  in  D  that  contain  X. 

An  association  rule  is  an  expression  of  the  form:  X  =>  Y,  where  X  and  Y  are  items 
and  X  n  Y  -  0.  Support  of  a  rule  X  =>  Y  is  denoted  by  sup(X  =>  Y)  and  is  defined  as 
sup(X  u  Y).  Confidence  of  the  rule  X  =>  Y  is  denoted  by  confiX  =>  Y)  and  defined  as 
sup(X  U  Y)  I  sup(X).  Usually,  one  is  interested  in  discovering  rules  that  have  support 
greater  than  a  specified  minimum  support  s  and  confidence  not  less  than  a  user  speci¬ 
fied  minimum  confidence  c.  Such  set  of  rules  will  be  denoted  by  AR(s,c),  i.e.  AR(s,c) 
=  { r|  sup(r')  >  s  a  confir)  >  c}).  If  s  and  c  are  understood  we  will  write  briefly  AR. 

Usually  ARs  are  too  numerous  for  practical  use.  One  of  the  most  popular  methods 
of  restricting  the  number  of  rules  is  to  generate  only  those  with  the  minimum  condi¬ 
tion  part.  Those  rules  are  in  particular  useful  in  classification  procedures.  It  was  shown 
in  [9]  that  for  any  rule  with  minimum  antecedent  one  can  reduce  its  consequent  with¬ 
out  any  loss  of  support  and  confidence;  instead,  it  may  even  lead  to  the  rule  with  better 
parameters.  This  observation  justifies  generating  the  rules  with  minimum  antecedents 
and  maximum  consequents  [10].  A  set  of  minimum  condition  maximum  consequence 
association  rules  wrt.  support  ,v  and  confidence  c  ( MMR(s,c ))  is  defined  as  follows: 

MMR(s,c )  =  {r:  ( X  =>  T)e  AR(j,c)|  -dr’:  (X’  =>  F)eAR(s,c),  r'*r  a  X’cX  a  FdT}. 

Recently  another  way  of  reducing  the  size  of  the  set  of  generated  rules  was  pro¬ 
posed  in  [3].  Let  us  recall  the  notion  of  representative  association  rules  and  cover 
operator.  Informally  speaking,  a  set  of  all  representative  association  rules  is  a  least  set 
of  rules  that  covers  all  association  rules  by  means  of  a  cover  operator.  The  cover  C  of 
the  rule  X  =>  Y,  Y  T-  0,  is  defined  as  follows: 

C(X=>  Y)  =  {XuZ=>  V|  Z,V  c  Y  aZc\V=  0aV^  0). 

Each  rule  in  C(X  =>  Y)  consists  of  a  subset  of  items  occurring  in  the  rule  X  =>  Y. 
The  antecedent  of  any  rule  r  covered  by  X  =>  Y  contains  X  and  perhaps  some  items  in 
Y,  whereas  r’s  consequent  is  a  non-empty  subset  of  the  remaining  items  in  Y.  For  the 
cover  C  the  following  property  holds: 

Property  1.  Let  r:  (X  =>  Y)  and  r’:  (X’  =>  T)  be  association  rules.  Then: 

r'e  C(r)  iff  X’uF  c  XuT  a  X’  D  X  iff  X’uf  c  XuT  a  X’  D  X  a  Y  c  Y. 

As  proved  in  [3],  for  an  association  rule  r  having  support  s  and  confidence  c,  each 
rule  in  the  cover  C(r)  belongs  to  AR(s,c).  Hence,  if  reAR(s,c ),  then  every  rule  in  C(r) 
also  belongs  to  AR(s,c).  This  property  can  be  applied  for  looking  for  a  base  of  rules 
covering  all  others.  In  the  sequel,  such  a  minimal  base  of  rules  will  be  called  a  set  of 
representative  association  rules  and  will  be  denoted  by  RR.  Formally,  a  set  of  repre¬ 
sentative  association  rules  wrt.  support  s  and  confidence  c  is  defined  as  follows: 

RR(s,c)  =  {reAR(s,c)|  ^3r’eAX(.y,c),  rVr  a  reC(r')}. 
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Each  rule  in  RR  is  called  a  representative  association  rule.  From  the  definition  of 
RR  it  results  that  no  representative  association  rule  belongs  to  the  cover  of  another 

association  rule  and  AR(s,c )  =  X)  ^mir)C(r).  It  was  proved  in  [4]  that  MMR  c  RR  and 
MMR  can  be  derived  from  RR. 


3  Support  and  Confidence  in  Incomplete  Databases 


Computation  of  a  real  support  of  itemsets  and  real  confidence  of  association  rules  is 
not  feasible  in  the  case  of  databases  with  missing  attribute  values.  However,  one  can 
always  calculate  possible  least  and  greatest  values  of  support  and  confidence,  as  well 
as,  try  to  predict  their  "expected"  values.  In  order  to  express  the  properties  of  data 
incompleteness  we  will  apply  the  following  notions: 

•  Missing  values  will  be  denoted  by  , 

•  The  maximal  set  of  tuples  that  certainly  match  an  itemset  X  is  denoted  by  n(X)  and 
is  defined  as  follows:  n(X)  =  {re  D|  V(a,v)e  X:  a(t)=v) , 

•  By  m(X)  we  denote  the  maximal  set  of  tuples  that  possibly  match  the  itemset  X  in  D, 
i.e.  m(X)  =  {reD|  V(a,v)eX:  a(t)&  {v,*}}  , 

•  The  set-theoretical  difference  m(X)  \  n(X)  is  denoted  by  d(X) , 

•  By  n(-X)  we  denote  the  maximal  set  of  tuples  that  certainly  do  not  match  the  item- 
set  X  in  D,  i.e.  n(-X)  =  D  \  m(X)  , 

•  By  m(-X)  we  denote  the  maximal  set  of  tuples  that  possibly  do  not  match  the  item- 
set  X  in  D,  i.e.  m(-X)  =  D  \  n(X) . 

Example  1.  Given  the  incomplete  database  D  presented  in  Fig.  1,  Fig.  2  illustrates  the 
notions  of  certainly  and  possibly  matching  an  itemset. 
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Fig.  1.  Example  incomplete  database 


Fig.  2.  Tuples  matching  itemset  X 


Let  least  possible  support  of  an  itemset  X  be  denoted  by  pSup(X)  (pessimistic  case) 
and  greatest  possible  support  be  denoted  by  oSup(X)  (optimistic  case).  Clearly, 

pSup(X)  =  |n(X)|  and  oSup(X)  =  \m{X)\ . 

Let  XczY.  It  is  easy  to  observe  that  pSup{X)  >  pSup(Y)  and  oSup(X)  >  oSupiY). 

Let  pConf(X=>Y)  and  oConj{X=$Y)  denote  least  possible  confidence  and  greatest 
possible  confidence  of  X=>Y,  respectively.  These  values  can  be  computed  according 
the  following  equations  (see  [5]  for  proof): 

pConf{X=$Y)  =  \n(X)r\n(Y)\  /  [\n(X)r\n(Y)\  +  \m(X)C\m(-Y)\\  ,  (1) 

oConfiX^Y)  =  \m(X)rm(Y)\  /  [\m(X)nm(Y)\  +  |n(X)nn(-T)|]  .  (2) 
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The  differences  between  optimistic  and  pessimistic  estimations  for  rules  can  be 
high.  It  would  be  desirable  to  have  a  method  of  predicting  support  and  confidence 
close  to  real  (though  unknown)  values.  There  were  proposed  in  the  literature  several 
definitions  of  expected  support  and  expected  confidence.  One  can  argue  which  defini¬ 
tion  is  better  or  when  it  should  be  applied.  Whatever  is  the  definition  of  expected  sup¬ 
port  and  confidence  in  an  incomplete  database,  we  do  not  accept  it  if  anyone  of  the 
postulates  below  is  not  satisfied: 

Postulates.  Let  X  and  Y  be  itemsets,  iSup(X)  denote  support  under  incompleteness, 
and  iConfiX)  denote  confidence  under  incompleteness,  AcAT,  and  Instances(A )  de¬ 


note  the  set  of  all  possible  tuples  over  the  set  of  attributes  A. 

iSupiX )  e  [pSup(X),  oSup(X)]  ,  (PI) 

iSup(X)  >  iSup{  Y)  for  XcT ,  (P2) 

iConf(X=>Y)  =  iSup(X'.jY)  /  iSup(X) ,  (P3) 

iConf(X=>Y)  e  {pConfiX=>Y),  oConfiX=$Y)]  ,  (P4) 

iSup(X)  =  I  for  any  AcAT .  (P5) 


The  definitions  of  support  under  incompleteness  and  confidence  under  incomplete¬ 
ness  that  satisfy  all  postulates  P1-P5  will  be  called  legitimate.  Below  we  will  show  an 
example  of  legitimate  approach  to  incompleteness  (see  [6]  for  proof). 

Example  2.  Let  ju:  A7'xL— >[0, 1]  denote  a  frequency  with  which  value  ve  Va  occurs  for 
the  attribute  aeAT  in  D,  defined  as  ju(a,v)  =  \n(a, v)|  /  |D  -  d(a,v) |).  Based  on  the  notion 
of  fi°,  the  probability  probSup,  of  supporting  an  item  (a,v)  by  the  tuple  te  D  is  defined: 

1  if  a(t )  =  v 

probSup  t(a,v)  =  ■  p°  (v)  if  a(t)  =  * 

0  otherwise . 


The  probability  (probSup ,)  of  supporting  an  itemset  X={(a],vl),..fal,vk)}  by  a  tuple 
te  D  is  defined  as  follows: 


probSup [X)  =  probSup ,(a ,,v,)  *  ...  *  probSup ,(at,vk)  . 

Probable  support  (probSup )  of  an  itemset  X  in  the  database  D  is  defined  below: 
probSup(X)  =  [S,eD,  probSupfiX)]  . 

Probable  confidence  (probConf)  of  a  rule  X=>Y  is  defined  in  usual  way: 

probConfiX^rY)  =  probSup(X'uY)  /  probSup(X)  .  • 

Incompleteness  of  the  data  introduces  a  confusion  of  how  to  treat  requests  for  a 
given  user-specified  support  and  confidence  of  rules.  One  can  imagine  that  a  user  is 
interested  in  the  rules  whose  pessimistic  support  and  expected  confidence  are  above 
requested  thresholds,  or  in  the  rules  whose  pessimistic  support  and  pessimistic  confi¬ 
dence  are  above  the  thresholds.  Other  variations  of  user  requirements  are  also  likely. 
To  this  end  we  proposed  in  [7]  a  generic  definition  of  types  of  association  rules: 

ARa[fs,c)  =  { rj  aSup(r)  >  s  a  fiConfir)  >  c}, 
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where  al  (3  could  be  substituted  by  either  p  (pessimistic),  or  o  (optimistic)  sup¬ 
port  /  confidence.  For  instance,  ARpo(s,c )  =  {r|  pSup{r)  >  s  a  oConfir)  >  c}. 

In  the  sequel,  we  investigate  properties  of  generic  legitimate  definitions  of  support 
and  confidence  under  incompleteness,  therefore,  we  presume  that  alp  can  be  also 
substituted  by  i  (legitimate  support  /  confidence  under  incompleteness).  Obviously,  the 
most  natural  combinations  of  requests  for  support  and  confidence  of  rules  are: 
ARpp(s,c),  ARoo(s,c ),  AR..(s,c).  Please  note  that  for  the  complete  database  all  these  defi¬ 
nitions  are  equivalent. 

Analogously  to  RR  for  a  complete  database,  we  introduced  the  notion  of  represen¬ 
tative  rules  for  incomplete  databases  in  [7],  Here  we  apply  the  same  generic  notation 
as  we  did  for  ARap  above.  A  set  of  representative  association  rules  wrt.  minimum 
support  .v  and  minimum  confidence  c  is  denoted  by  RRaps,c)  and  defined  as  follows: 

RRaps,c)  =  {rG  ARaps,c)\  — i 3r’eARaps,c),  rVr  a  re  C(r’)}. 

Property  2  [7].  Let  r,  r'  be  rules  and  r'e  C(r).  Then: 

a)  pSup(r’)  >  pSup(r)  and  oSup(r)  >  oSup(r) , 

b)  pConfir’)  >  pConJ{r)  and  oConfir')  >  oConf[r)  . 

By  analogy  we  introduce  here  the  same  generic  a/3  notation  for  MMR  as  we  did  for 
AR  and  RR.  A  set  of  minimum  condition  maximum  consequence  association  rules  wrt. 
support  s  and  confidence  c  will  be  denoted  by  MMRa^syc)  and  defined  as  follows: 

MMRaps,c)={r(X^Y)eARnf/s,c)\-A3r’:(T=>Y)eARa,is,c),  rVr  aX’cXa FdT}. 

In  the  next  sections  we  will  investigate  relationships  among  legitimate  AR ap ,  RRap 
and  MMRap,  in  the  case  of  an  incomplete  database. 


4  Legitimate  Representative  Rules 

In  this  section  we  will  show  that  legitimate  ARaps,c )  can  be  derived  syntactically 
from  legitimate  RRa/£s,c).  Let  us  start  with  examining  properties  of  rules  related  by 
the  cover  operator. 

Property  3.  Let  r,  r'  be  some  rules  and  r'e  C(r).  Then: 

a)  iSup(r’)  >  iSup(r) , 

b)  iConflr)  >  iConfir)  . 

Proof:  Let  r:  X=>f ,  r':  X=>F  and  r'e  C(r). 

Ad.  a)  Follows  immediately  from  Property  1  and  Postulate  P2. 

Ad.  b)  iConflX=$Y)  =  iSup(XKjY)  /  iSup(X')  and  iConf(X=>Y)  =  iSup(XvjY)  /  iSup(X). 
It  follows  from  Property  1  and  Postulate  P2  that  iSupiXvjY)  >  iSup(X'UY)  and 
iSup(X )  <  iSup(X).  Hence,  iConfiX'=>Y)  >  iConj{X=$Y).  • 

Properties  2-3  allow  us  to  conclude  with  the  following  property: 

Property  4.  Let  r,  r'  be  rules  and  r'e  C(r).  If  reARaps,c ),  then  r'eARaps,c). 
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By  definition  of  RRap  and  from  Property  4  one  can  infer  that  all  association  rules  of 
any  afi  type  can  be  derived  by  means  of  cover  operator  from  representative  rules  of 
the  same  type,  which  is  stated  by  Property  5. 

Property  5.  AR<4s,c )  = 

5  Legitimate  Minimum  Condition  Maximum  Consequence  Rules 

In  Section  5,  we  investigate  how  incompleteness  influences  the  relationship  between 
RR  and  MMR.  In  particular,  we  prove,  that  the  minimum  condition  maximum  conse¬ 
quence  rules  of  afi  type  constitute  a  subset  of  representative  rules  of  the  same  type. 

Property  6.  MMRap(s,c)  c  RRafis,c). 

Proof:  We  will  write  shortly  RRap,  and  MMRap  instead  of  RRafis,c),  and  MMRap(s,c), 
respectively.  Property  1  allows  us  to  express  RRap  as  follows: 

RRap={r:  (X=>Y)eARap\  -3 r’:  (X’=*T)eARap,  r’*r  a  XuFcX’uF  a  X’cX  a  FdF). 

On  the  other  hand,  MMRap  is  defined  as  follows: 

MMRap  =  {r.  (X=>  Y)eARap\  -,3 r’:  (X’  =»  T)&ARap,  r’*r  a  X’qX  a  F^F}. 

Using  the  two  formulae  above  one  can  easily  observe  that  each  rule  in  MMRap  be¬ 
longs  also  to  RRap ,  but  not  necessarily  vice  versa.  • 

Next,  we  prove  that  the  minimum  condition  maximum  consequence  rules  of  any  afi 
type  can  be  extracted  from  representative  rules  of  the  same  type. 

Property  7. 

MMRa/£s,c)={r:(X=>Y)eRRafj(s,c)\-ar’:(X’=*r)<=:RRa/6s,c),r’*rAX’QXA  FdF}. 

Proof:  We  will  write  shortly  RRap,  and  MMRap  instead  of  RRafis,c),  and  MMRap(s,c), 
respectively.  By  Property  6,  MMRap  are  contained  in  RRap.  So,  MMRap  = 
{r.  (X=>  F)e  RRap\  -.3 r':  ^  =>  T)eARap,  rVr  a  X’qX  a  TqY).  Now,  we  have  only 
to  prove  that  for  any  r:  (X  =>  Y)eARap,  the  expression  (3)  is  equivalent  to  (4). 

3 r’:  (X’  =>  T)eARap,  r’Ar  a  X’qX  a  FdF  (3) 

3 r”:  (X’  =>  F’)e  RRap,  r”*r  a  X’qX  a  Y’qY.  (4) 

Let  r’:  ( X ’  =>  Y’)€ARap  and  r’Ar  and  X’qX  and  Y’qY.  Each  association  rule  be¬ 
longs  to  the  cover  of  some  representative  rule,  so  there  is  some  r”:  (X’  =>  F’)  in  RRap, 
such  that  r’e  C(r”).  Hence,  X’qX’qX  and  Y'qY’qY  and  thus,  (3)  implies  (4).  The 
inverse  implication  is  trivial  (any  representative  rule  is  association  one).  • 


6  The  Algorithms 

In  this  section,  we  will  present  the  algorithms  of  generating  ARpp ,  ARoo  and  AR...  Other 
ARap  combinations  can  be  computed  similarly.  The  problem  of  generating  association 
rules  is  usually  decomposed  into  two  subproblems: 
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1.  Generate  all  itemsets  whose  support  exceeds  the  minimum  support  minSup.  The 
itemsets  of  this  property  are  called  frequent  (large). 

2.  Generate  association  rules  from  frequent  itemsets.  Let  X  be  a  frequent  itemset  and 
0  it  y  c  X.  Then  any  rule  X  \  Y  =>  Y  holds  if  ( sup(X )  /  sup(X  \Y))>  minConf. 

Step  (1)  of  our  algorithms  is  based  on  the  algorithm  in  [8],  in  that  we  use  the  lists  of 
transaction  identifiers.  Step  (2)  is  performed  with  the  ap-genrules  algorithm  [2],  ex¬ 
cept  for  computing  confidence,  which  we  show  below.  Further  on,  we  will  use  the 
following  auxiliary  notions:  k-itemset  is  a  set  of  size  k,  a  regular  itemset  is  an  itemset 
without  missing  values. 


6.1  Generation  of  Frequent  Itemsets  under  Incompleteness 

First,  let  us  remind  briefly  the  main  idea  of  the  gen_large_itemsets  algorithm  (see 
Fig.  3)  computing  frequent  itemsets  [8].  Then,  we  will  offer  modifications  of  this 
algorithm  allowing  to  compute  frequent  itemsets  for  given  threshold  of  supports:  pes¬ 
simistic,  optimistic,  and  under  incompleteness  as  defined  in  Example  2.  The  following 
notation  is  used  in  the  genjargejtemsets  function: 

•  Ft-  set  of  frequent  Gitemsets; 

•  /[l]  */[2]  •  ...  •  /[£]  -  fc-itemset  consisting  from  items /[l],/[2], ... 

•  tldList  -  a  list  of  transaction  identifiers; 

Associated  with  each  itemset  is  a  support  field  to  store  the  support  for  this  itemset 
and  tldList  to  store  identifiers  of  transactions  containing  the  itemset. 

function  gen_large_i temsets(D:  database); 

1)  compute  the  family  Fx  of  frequent  1-itemsets  and  their  tldLists ; 

2)  for  (k  =  2;  Fk  *  0;  k++)  do  { 

3)  forall  fi  €  Fk-i  do  { 

4)  forall  f2  6  Fk.i  do  { 

5)  if  fi[l]=f2[lj  a... a  f1[k-2)=f2[k~2]  a  fi[*-l]<f2[Jt-l]  then  { 

6)  c  =  fill]  •  fx[3]  •  •  fi[*-lj  •  f2[k- 1]; 

7)  if  c  has  k  subsets  in  Fk. i  then  { 

8)  c. tldList  =  fi- tldList  n  f2.  tldList; 

9)  c. support  =  \c. tldList]; 

10)  if  c.  support  >  minSup  then  Fk  =  Fk  u  {c}  ;  >>}}> 

11)  return  L7*  Fk; 


Fig.  3.  Function  gen_large_itemsets 

The  gen_large_itemsets  function  reads  the  database  only  once  in  order  to  create 
lists  of  transaction  identifiers  for  each  item  occurring  in  the  database.  F,  is  assigned 
those  1-itemsets  that  have  support  not  less  than  minSup.  Next,  each  L-th  iteration, 
k>2,  generates  candidate  Gitemsets  from  the  pairs  of  frequent  (L-l)-itemsets  (see  [8] 
for  details).  To  avoid  unnecessary  computations  of  tldLists,,  the  candidate  /c-itemsets 
that  do  not  have  some  subset  in  frequent  (Gl  (-itemsets  are  pruned.  The  tldList  for 
each  remaining  candidate  c  is  computed  as  the  intersection  of  tldLists,  of  (fc-l)-itemsets 
that  were  used  for  constructing  c.  Its  length  determines  the  support  for  c.  The  k- 
itemsets  with  support  greater  than  minSup  are  included  into  Fk. 


512  M.  Kryszkiewicz  and  H.  Rybinski 


Now,  we  will  present  necessary  modification  of  gen_large_itemsets  for  computing 
frequent  itemsets  in  the  case  when  the  database  is  incomplete. 

The  Generate-P-Frequentltemsets  Function 

It  computes  all  frequent  itemsets  X  such  that  pSup(X)  >  minSup.  It  differs  from 
gen_large_itemsets  only  in  line  1  that  initializes  1 -itemsets.  The  appropriate  code 
corresponding  to  line  1  of  gen_large_itemsets  is  as  follows: 

compute  frequent  regular  1-itemsets  Fj.  and  their  tldLists ,  where  tldList  for 
each  c  in  Fi  is  the  list  of  identifiers  of  transactions  that  certainly  contain  c. 


The  Generate-O-Frequentltemsets  Function 

It  computes  all  frequent  itemsets  X  such  that  oSup(X)  >  minSup.  It  differs  from 
gen_large_itemsets  only  in  line  1  that  initializes  1-itemsets.  The  appropriate  code 
corresponding  to  line  1  of  gen_large_itemsets  is  as  follows: 

compute  frequent  regular  1-itemsets  Fx  and  their  tldLists ,  where  tldList  for 
each  c  in  Fi  is  the  list  of  identifiers  of  transactions  that  possibly  contain  c. 


The  Generate-I-Frequentltemsets  Function 

It  computes  all  frequent  itemsets  X  such  that  probSup(X)  >  minSup.  The  function  uses 
a  modified  tldList  structure: 

Let  c  be  a  candidate  fc-itemset.  The  elements  of  c.tldList  will  be  pairs  (tld,  iVec) 
where,  tld  is  the  transaction  identifier  and  iVec  is  a  Boolean  vector  of  size  k  that  for 
each  item  c[j]ec,j=  1, ...,  k,  indicates  if  the  transaction  identified  by  tld  contains  the 
item  certainly.  If  so,  iVec\j]  is  assigned  1,  otherwise  it  is  equal  0.  In  addition, 
1-itemsets  have  a  component  p  that  stores  information  on  frequencies  of  the  items. 

The  function  differs  from  gen_large_itemsets  in  line  1  (that  initializes  1-itemsets) 
and  in  lines  8-9  (that  compute  tldList  and  support ,  respectively).  The  appropriate  code 
corresponding  to  line  1  of  gen_large_itemsets  is  as  follows: 

compute  frequent  regular  1-itemsets  as  well  as  their  frequency  p.  and  tldLists 
keeping  information  on  transactions  that  possibly  contain  c; 

The  code  corresponding  to  lines  8-9  should  be  as  follows: 

forall  ti  e  f-i .  tldList  do 

forall  t2  e  f2.  tldList  do 
if  fci.fcJd  =  t2.tld  then 

add  ( ti .  t  Id,  ti .  iVec  •  t2.iVec[k- 1])  to  c.tldList; 
c. support  =  0; 
forall  t  e  c . tldList  do  { 
probSubt  =  0; 
for  j=l  to  k  do  { 

probSubt  =  probSubt  *  max{iVec[j]  ,  ( c  [  j  ]  )  ;  ); 

c.  support  =  c.  support  +  probSubt;  }; 

The  expression  max(iVec\j],  p(c[j]))  returns  1  if  the  y'-th  item  of  c  is  contained  by 
the  respective  transaction  certainly,  otherwise  it  returns  the  frequency  of  this  item. 
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6.2  Computing  Confidence  of  Association  Rules  under  Incompleteness 

The  problem  of  finding  ARpp,  ARoo  and  ARU  consists  in  computing  pConf,  oConf  and 
probConf  for  candidate  rules,  respectively. 

Computing  ARpp 

Rules  ARpp  are  computed  from  frequent  itemsets  F  generated  by  Generate-P- 
Frequentltemsets.  Let  Ze  Fk,  k> 2,  and  XuF=Z.  Then  X=>YeARpp  if  pConf{X=>Y)  > 
minConf.  In  order  to  compute  pConflX=>Y),  we  have  to  know  support  of  both 
n(X)nn(F)  and  m(X)C\m(-Y)  (see  Eq.  (1)).  Clearly,  |n(X)nn(F)|  =  |n(XuF)|  =  pSup(Z), 
which  was  computed  when  looking  for  frequent  itemsets.  Still,  \m(X)rm(-Y)\  (equal  to 
|m(X)Vi(F)|)  must  be  computed.  Assume,  the  computation  of  ARpp  was  preceded  by 
generating  nTIdLists  and  mTIdLists  for  each  /£  F,.  These  lists  consist  of  identifiers  of 
transactions  containing  /  certainly  and  possibly,  respectively.  The  nTIdList  and 
mTIdList  of  X  and  F,  can  be  computed  by  intersecting  nTIdLists  and  mTIdLists  of  all 
items  in  these  itemsets,  respectively.  Then,  pConf{X^>Y)  can  be  computed  as  follows: 

X.  mTIdList  =  X[l]  .mTIdList  n  X[2]  .mTIdList  n  ...  n  X[\x\]  .mTIdList; 

Y.  nTIdList  =  y[l)  .nTIdList  n  Y[ 2]  .nTIdList  n  ...  n  Y[  \Y\]  .nTIdList; 

pConf  =  sup[Z)  /  [sup(Z)  +  | X. mTIdList  \  Y.nTIdLis t|]; 

Computing  ARco 

Rules  ARm  are  computed  from  frequent  itemsets  F  generated  by  Generate-O- 
Frequentltemsets.  Let  ZeF  and  XuF=Z.  Then  X-=>YeARm  if  oConf{X=>Y)  >  minConf. 
By  analogy  to  pConf,  oConf  X^>Y)  will  be  computed  as  follows  (see  Eq.  (2)): 

X.  nTIdList  -  XU]  .nTIdList  n  X[2]  .nTIdList  n  ...  n  XI  /X/]  .nTIdList; 

Y.  mTIdList  =  Y[l]  .mTIdList  n  Y12]  .mTIdList  n  ...  n  Y[  /  Y/]  .mTIdList ; 

oConf  =  sup(Zt  /  IsupfZ )  +  jx. nTIdList  \  Y.mTIdListj  ]  ; 


Computing  ARU 

Rules  AF.  are  computed  from  frequent  itemsets  F  generated  by  Generate-I- 
Frequentltemsets.  Let  ZeF  and  Xu F=Z.  Then  X=>FeAF.  if  probConfiX=>Y)  >  min¬ 
Conf.  Fortunately,  probConf  can  be  computed  as  usual  conf  of  rules. 


7  Conclusions 

We  investigated  the  notion  of  legitimate  definition  of  support  and  confidence  under 
incompleteness.  It  was  shown  that  for  incomplete  datasets  legitimate  ARap  can  be 
derived  from  legitimate  RRap  by  the  cover  operator.  We  also  proved  that  MMRap  un¬ 
der  incompleteness  constitute  a  subset  of  RRap  of  the  same  type.  Algorithms  for  gen¬ 
erating  ARap  under  incompleteness  were  offered. 
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A  Simple  and  Tractable  Extension  of  Situation 
Calculus  to  Epistemic  Logic 
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1  Introduction 

The  frame  problem  and  the  representation  of  knowledge  change  have  deserved  a 
lot  of  works.  In  particular,  at  the  Cognitive  Robotics  Group,  at  Toronto,  several 
researchers  in  the  last  ten  years  have  produced  quite  interesting  papers  in  a  uni¬ 
form  logical  framework  based  on  Situation  Calulus  [Rei91,  SL93,  LR94,  LL98].  In 
[Rei91]  Reiter  has  proposed  a  simple  solution  to  the  frame  problem.  Scherl  and 
Levesque  in  [SL93]  have  defined  an  extension  to  Epistemic  Logic  to  represent 
knowledge  dynamics  in  contexts  where  some  actions  may  produce  knowledge, 
like,  for  instance,  sensing  actions  for  a  robot.  This  approach  has  been  extended 
by  Lakemeyer  and  Levesque  in  [LL98]  to  modal  operators  of  the  kind  “I  know 
and  only  know” .  Also,  they  have  given  a  formal  semantics  and  axiomatics,  and 
they  proved  soundness  and  completeness  of  the  axiomatics. 

These  extensions  to  Epistemic  Logic  offer  a  large  expressive  power.  Indeed, 
there  is  no  restriction  on  formulas  in  the  scope  of  modal  operators.  However,  they 
have  lost  the  simplicity  of  the  solution  to  the  frame  problem  initially  proposed  in 
[Rei91] ,  and  the  possibility  to  find  a  tractable  implementation  of  these  extensions 
is  far  to  be  obvious.  As  far  as  we  know,  at  the  present  time  there  is  no  such 
implementation. 

In  this  paper  a  simple  extension  to  Epistemic  Logic  of  Reiter’s  initial  solution 
is  presented  that  could  easily  be  implemented.  In  exchange  we  have  to  accept 
strong  restrictions  on  the  expressive  power  of  the  epistemic  part  of  the  logical 
framework.  However,  we  believe  that  for  a  large  class  of  applications  these  re¬ 
strictions  are  not  real  limitations.  In  the  following  intuitive  ideas  of  the  proposed 
solution  are  presented  with  a  simple  example.  Then,  we  give  the  general  logical 
framework,  and,  finally  a  comparison  is  made  with  the  solutions  that  we  have 
mentioned  before. 

2  The  frame  problem  in  the  context  of  extended  situation 
calculus:  an  example 

Situation  Calculus  [McC68,  Rei99]  is  a  sort  of  classical  first  order  logic  where 
predicates  may  have  an  argument  (the  last  argument)  of  a  particular  sort,  which 
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is  called  a  “situation”;  these  predicates  are  called  “fluents”.  This  argument  is 
intended  to  represent  the  sequence  of  actions  which  have  been  performed  from 
the  initial  state  to  the  current  state.  A  situation  is  syntactically  represented  by 
a  term  of  the  form  do(a ,  s)  where  a  denotes  an  action,  and  s  denotes  a  situation. 
The  initial  situation  is  denoted  by  50. 

For  instance,  position(x ,  s)  represents  the  fact  that  a  given  object  is  at 
the  position  x  in  the  situation  s.  Action  variables  and  situation  variables  can 
be  quantified.  For  instance,  -i3s(position( 2,  s))  represents  the  fact  that  in  no 
situation  a  given  object  is  at  the  position  2.  Action  quantification  is  an  es¬ 
sential  feature  in  the  solution  to  the  frame  problem  proposed  by  Reiter.  In¬ 
deed,  the  fact  that,  for  example,  there  is  no  other  possibility  to  change  the 
position  of  an  object  than  to  perform  the  action  move  can  be  represented 
by:  V sMcNx{position(x,s )  A  -iposition(x ,  do(a,  s))  — >■  a  —  move).  To  intuitively 
present  how  the  solution  to  the  frame  problem  can  be  extended  to  epistemic 
logic,  we  use  the  following  scenario. 

Let’s  consider  a  simple  robot  that  can  move  forward  (action  adv)  or  back¬ 
ward  (action  rev)  along  a  railtrack.  Performance  of  actions  adv  or  rev  changes 
his  position  of  one  distance  unit.  There  may  be  obstacles  on  the  railtrack,  like 
branches  of  trees  that  have  fallen.  Suppose  the  robot  is  moving  during  the  night 
and  there  is  a  pilot  in  the  robot.  The  pilot  can  recognise  obstacles,  provided  he 
has  switched  on  a  spotlight  (action  obs. obstacle),  and  the  obstacle  is  not  beyond 
the  visibility  distance  d.  The  spotlight  is  not  always  on  because  it  consumes 
battery  ressources,  which  are  limited.  When  the  robot  moves  he  computes  his 
new  position,  and  this  position  is  indicated  on  a  screen  which  can  be  seen  by 
the  pilot  (action  inf.position(x)).  The  pilot  performs  the  action  inf.position(x) 
before  the  action  obs. obstacle  in  order  to  know  his  position  and  to  determine  the 
position  of  visible  obstacles,  if  there  are.  The  pilot  can  inform  the  robot  about 
the  existence  of  an  obstacle  at  x  (action  inf.obstacle(x)),  and  the  robot  stops  if 
he  knows  that  there  is  an  obstacle  in  a  short  distance  sd. 

We  see  that  the  description  of  this  scenario  involves  evolution  of  the  world 
and  evolution  of  what  the  pilot  and  the  robot  believe  2.  We  first  show  how  the 
frame  problem  can  be  solved  if  we  only  consider  evolution  of  the  world . 

For  each  fluent,  two  axioms  define  the  positive  effects  or  the  negative  effects  of 
the  actions.  For  instance,  for  the  fluent  position(x ,  s) ,  the  effect  of  performing  the 
action  adv  (respectively  rev)  when  the  robot  is  at  the  position  x  —  1  (respectively 
x  +  1)  in  the  situation  s,  is  that  it  is  at  the  position  x  in  the  situation  do(a,  s)  3: 

(1)  (a  =  adv  A  position(x  —  1,  s)  V  a  =  rev  A  position^ x  +  1,  s))  — » 
position(x ,  do(a,  s)) 

The  negative  effect  axiom  expresses  that  if  the  robot  is  at  the  position  x  in 
the  situation  s  and  he  performs  either  the  action  adv  or  the  action  rev,  then  in 
the  situation  do(a,  s)  he  is  no  more  at  the  position  x\ 

2  We  have  no  room  here  to  give  a  complete  formal  description  of  this  scenario.  Also, 
some  assumptions  are  not  perfectly  realistic,  but  we  mainly  want  to  show  how  such 
scenarios  can  be  formalised. 

All  the  variables  are  implicitly  universally  quantified. 


A  Simple  and  Tractable  Extension  of  Situation  Calculus  517 


(2)  (a  =  adv  V  a  =  rev )  A  position{x,  s)  — >  ->position( x,  do(a,  s )) 

One  of  the  most  important  features  to  solve  the  frame  problem  in  the  ap¬ 
proach  presented  in  [Rei99]  is  the  “causal  completeness  assumption” .  This  as¬ 
sumption  expresses  that  the  positive  effect  axioms  and  the  negative  effect  axioms 
“characterize  all  the  conditions  underwhich  action  a  can  cause  the  fluent  position 
to  become  true  (respectively  false)  in  the  successor  situation” .  If,  in  addition  to 
(1)  and  (2),  we  accept  this  assumption,  then  we  have  (see  axiom  (G2)  for  the 
general  form): 

(3)  position(x,do(a,s ))  [a  =  advAposition(x—l,  s)\/a  =  revAposition(x+ 

1,  s)]  V  position(x,  s)  A  ~>[(a  =  adv  V  a  =  rev)  A  position(x,  s)] 

This  axiom  defines  the  objective  representation  of  the  evolution  of  the  world. 
If  we  want  to  define  the  subjective  representation  of  the  evolution  of  the  world, 
we  can  extend  the  language  with  epistemic  modal  operators.  For  that  purpose, 
we  introduce  modal  operators  like  Br,  such  that  Br<t>  is  intended  to  mean  that  the 
robot  r  believes  that  <j>  holds  in  the  present  situation.  To  represent,  in  a  similar 
approach,  the  evolution  of  what  the  robot  believes,  we  have  to  consider  four 
effect  axioms  for  each  fluent.  For  example,  for  the  fluent  position(x ,  s),  there  are 
four  distinct  possible  attitudes  of  the  robot  which  are  formally  represented  by: 
Brposition(x,  s),  -> Brposition(x,  s),  Br-<position(x,  s)  and  -'Br^position(x,s). 
The  corresponding  axioms  (4),  (5),  (6)  and  (7)  are  given  below. 

The  effect  of  performing  action  adv  (respectively  rev)  when  the  robot  believes 
that  he  is  at  the  position  x  —  1  (respectively  x  +  1)  in  the  situation  s  is  that  he 
believes  that  he  is  at  the  position  x  in  the  situation  do(a,  s): 

(4)  (a  =  adv  A  Brposition(x  —  l,s)  V  a  =  rev  A  Brposition(x  +  l,s))  — > 
Brposition(x,  do(a,  s)) 

The  effect  of  performing  either  action  adv  or  rev  when  the  robot  believes 
that  he  is  at  the  position  x  in  the  situation  s  is  that  he  does  not  believe  that  he 
is  at  the  position  x  in  the  situation  do(a,  s): 

(5)  (a  =  adv  Va  =  rev)  A  Brposition(x,  s)  — y  ~^Brposition(x,  do(a,  s)) 

We  have  two  similar  axioms  to  define  the  attitude  of  the  robot  with  respect 
to  the  fact  that  he  believes  that  he  is  not  at  the  position  x  in  the  situation 
do(a,  s ): 

(6)  (a  =  adv  Va  =  rev )  A  Brposition{x ,  s)  —>  Br-<position(x,  do(a,  s)) 

(7)  (a  =  adv  A  Brposition(x  —  l,s)  V  a  =  rev  A  Brposition(x  +  l,s))  — y 
-iBr-iposition(x ,  do(a:  s)) 

If  we  extend  the  causal  completeness  assumptions  to  the  robot’s  beliefs,  we 
get,  after  some  simplifications,  the  two  axioms  (8)  and  (9)  (see  axioms  (G3)  and 
(G4)  for  the  general  form): 

(8)  Brposition(x ,  do(a,  s))  ■(->  [a  =  adv  A  Brposition(x  —  l,s)  V  a  =  rev  A 
Brposition(x  +  l,  s)] V Brposition(x,  s)  A->[(a  =  advVa  =  rev)  A Br position (®,  s)] 

(9)  Br-iposition{x,  do(a,  s))  [(a  =  adv  V  a  =  rev)  A  Brposition(x,  s)]  V 

Br->position(x,  s)  A  ->[a  =  adv  A  Brposition{x  —  1,  s)  V  a  =  rev  A  Brposition[x  + 

M)] 

Notice  that  in  the  definition  of  these  axioms  we  have  implicitly  assumed  that 
if  the  robot  performs  either  the  action  adv  or  the  action  rev,  he  believes  that  he 
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has  performed  these  actions.  However,  if  some  action  is  performed  by  the  pilot, 
like  the  action  obs. obstacle,  the  robot  is  not  necessarily  informed  about  this  fact. 

It  is  interesting  to  see  with  this  example  how  the  pilot’s  beliefs  and  the  robot’s 
beliefs  about  the  fluent  obstacle  may  evolve  in  two  different  way.  We  have  the 
following  effect  axioms  (10),  (11),  (12)  and  (13)  for  this  fluent. 

If  the  pilot  has  switched  on  the  spot  light,  and  there  is  an  obstacle  at  some 
position  x  which  is  visible  by  the  pilot,  then  the  pilot  believes  that  there  is  an 
obstacle  at  x  4  : 

(10)  a  —  obs.obstacle  A  obstacle(x,  s)  /\  positional/,  s)  A  y  <  x  <  y  +  d  -4 
Bpobstacle(x,  do(a,  s)) 

If  the  pilot  has  switched  on  the  spot  light,  and  there  is  no  obstacle  at  some 
position  x  which  is  visible  by  the  pilot,  then  the  pilot  does  not  believe  that  there 
is  an  obstacle  at  x: 

(11)  a  =  obs.obstacle  A  ->obstacle(x,  s)  A  position(y,  s)Ay<x<y  +  d-^- 
~'Bpobstacle(x,  do(a,  s)) 

We  have  two  similar  effect  axioms  for  ->obstacle(x,  do(a,  s)). 

(12)  a  =  obs.obstacle  A  -i  obstacle(x,  s)  A  position(y,  s)  Ay  <x<y  +  d— 4 
Bp-^obstacle(x,  do(a,  s )) 

(13)  a  =  obs.obstacle  A  obstacle(x,s)  A  position(y ,  s)  Ay<x<y  +  d—> 
-i <Bp-iobstacle(x ,  do(a,  s)) 

Then,  from  the  causal  completion  assumption  we  have  the  axioms  (14)  and 

(15) . 

(14)  Bpobstacle(x,  do(a,  s))  44  [a  =  obs  .obstacle Aobstacle(x ,  s)Aposition(y,  s) 
A  y  <  x  <  y  +  d]  V  Bpobstacle{x,s)  A  -i[a  =  obs.obstacle  A  -iobstacle(x,  s)  A 
position(y,  s)Ay<x<y  +  d] 

(15)  Bp-<obstacle(x,  do(a,  s))  44  [a  =  obs.obstacleA->obstacle(x,s)Aposition 
(y,  s)  A  y  <  x  <  y  +  d]  V  Bp~iobstacle(x,  s)  A  ->[a  =  obs.obstacle  A  obstacle(x,  s)  A 
position(y,  s)Ay<x<y  +  d] 

If  the  only  way  for  the  robot  to  be  informed  about  the  fact  that  there  is  an 
obstacle  at  x  is  to  perform  the  action  inf.obstacle(x),  then  we  have  the  axioms 

(16)  and  (17)  below. 

(16)  Brobstacle(x,  do(a,  s))  44  o  =  inf.obstacle(x)  V  Brobstacle(x,  s) 

(17)  Br->obstacle{x,  do(a,  s))  44  Br^obstacle[x,  s)  A  ->(a  =  inf.obstacle(x)) 
Let’s  assume  that  in  the  initial  situation  50  the  pilot  and  the  robot  both 

ignore  whether  there  are  obstacles  in  any  places.  This  is  formally  represented 
by:  -i3xBrobstacle(x,S0),  -<3xBr-iobstacle(x,  SO),  -^3xBvobstacle{x,  SO)  and 
S\xBp-'obstacle(x ,  50).  If  in  the  situation  50  there  is  an  obstacle  at  the  position 
3,  the  pilot  and  the  robot  have  wrong  beliefs.  If  the  distance  d  is  equal  to  10,  after 
performance  of  the  action  ai  =  obs.obstacle,  the  pilot  believes  that  there  is  an  ob¬ 
stacle  at  the  position  3,  while  the  robot  ignores  that  there  this  an  obstacle  at  the 
position  3,  i.e.  we  have:  Bpobstacle( 3,  do(ai ,  50))  and  -iBrobstacle(3,  c?o(ai,  50)). 
Finally,  if  after  action  a\  the  pilot  performs  the  action  <22  =  inf.obstacle(3), 

4  As  a  matter  of  simplification,  it  is  assumed  here  that  the  pilot  only  looks  at  obstacles 
that  are  foreward. 
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the  robot  and  the  pilot  have  the  same  beliefs  about  this  obstacle.  We  have: 
BpObstacle(3,  do{a2,do{a\,  SO)))  and  Brobstacle( 3,  do(a 2,  do(ai,  SO))). 

In  fact  these  actions  can  be  performed  only  if  some  preconditions  are  satisfied. 
These  preconditions  are  expressed  with  a  particular  predicate  Poss  (see  [Rei99]). 
The  formula  Poss(a,  s )  means  that  in  the  situation  s  it  is  possible  to  perform 
the  action  a.  For  example,  a  precondition  to  perform  the  action  adv  is  that  the 
robot  does  not  believe  that  there  is  an  obstacle  in  a  short  distance  sd,  and  there 
is  no  obstacle  in  front  of  him. 

(18)  Poss(adv,  s)  -F3x3y{Brposition(x,  s)  A  Brobstacle(y,  s)  A  y  —  x  < 
sd)  A  — y(position(x,  s )  A  obstacle{y,  s)  A  y  =  x  +  1) 


3  General  framework 

Now  we  present  the  general  framework  of  the  extended  Situation  Calculus.  Let  L 
be  a  first  order  language  with  equality  with  the  constant  symbol  SO,  the  function 
symbol  do,  and  the  predicate  symbol  Poss.  Let  Lm  be  an  extension  of  language 
L  with  modal  operators  denoted  by  B\ , . . . ,  Bi, . . where  modal  operators  can 
only  occur  in  modal  literals.  Modal  literals  are  of  the  form  B,d,  where  Z  is  a  literal 
of  L,  and  l  is  not  formed  with  equality  predicate.  Let’s  consider  the  theory 
T  which  contains  the  following  axioms. 

Action  precondition  axioms. 

For  each  action  a  there  is  in  T  an  axiom  of  the  form  5: 

(Gl)  Poss(a,s )  «-»  7ra(s) 
where  ira  is  a  formula  in  Lm  ■ 

Successor  state  axioms. 

For  each  fluent  F  there  is  in  T  an  axiom  of  the  form: 

(G2)  F(do(a,  s))  <->•  Ff  (a,  s )  V  F(s)  A  -<rf  (a,  s) 
where  rF  and  Ff  are  formulas  in  L. 

Successor  belief  state  axioms. 

For  each  modal  operator  B,-  and  each  fluent  F  6 ,  there  are  in  T  two  axioms 
of  the  form: 

(G3)  Bi(F(do(a,  s)))  B;+F(a,  s)  V  B,(F(s))  A  ~'P1rup{a,  s) 

(G4)  B^Fido^s)))  »  rlp(a,s)V  Bi(-iF(s))  A-lrrF(a,s) 
where  T)*  F ,  F~  p ,  F ,  and  TT  F  are  formulas  in  Lm- 

We  also  have  in  T  unique  name  axioms  for  actions  and  for  situations,  and 
we  assume  that  modal  operators  obey  the  (KD)  logic  (see  [Che88]). 

5  As  a  matter  of  simplification  the  arguments  of  function  symbols  are  not  explicited, 
and,  for  fluents,  the  only  argument  which  is  explicited  is  the  situation.  For  instance, 
we  could  have  a(;ri)  and  F(xi,  x2,  s).  Also,  it  is  assumed  that  all  the  free  variables 
are  universally  quantified. 

6  To  avoid  to  have  equality  in  the  scope  of  modal  operators,  we  assume  that  fluent 
functions  are  expressed  via  fluent  predicates,  i.e.  y  =  f(x,  s)  is  expressed  by  F(y,  x,  s). 
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Moreover,  it  is  assumed  that  for  each  fluent  F  we  have  7: 

(HI)  bT->V-.(r+ ATp) 
m  hr^v.(r+FAfrf) 
m  \-T-t  v-<(r£iF  a  rrF) 
m  \-t-+  v->(/;+ F  a  r£F) 

(h 5)  hT^m(F(s))Art+2>F^rrF) 

(H  6) 

The  three  assumptions  (H4),  (H5)  and  (H6)  guarantee  that  if  agents’  beliefs 
are  consistent  in  the  intial  state,  they  are  consistent  in  all  the  successor  states. 

It  can  easily  be  shown  that,  if  we  have  (HI),  in  the  context  of  the  theory  T, 
successor  state  axioms  like  (G2)  are  equivalent  to  the  conjunction  of  properties 
(Al),  (A2),  (A3),  and  (A4). 

(Al)  rF (a,  s )  —4  F(do(a ,  s)) 

(A2)  Fp  (a,  s)  —4  -i F(do(a ,  s)) 

(A3)  ^r-(a,s)  -4  [F(s)  -4  F(do(a,fi))] 

(A4)  -i F^(a,s)  —4  [_,T(s)  —4  -<F(do(a,  s))] 

In  a  similar  way  we  have  shown  that,  if  we  have  (H2)  and  (H3),  in  the  context 
of  the  theory  T,  successor  belief  state  axioms  of  the  form  (G3)  (resp.  (G4))  are 
equivalent  to  the  conjunction  of  properties  (Bl),  (B2),  (B3)  and  (B4)  (resp.  (Cl), 
(C2),  (C3)  and  (C4)). 

(Bl)  ^(a^J^Bi^^a.s))) 

(B2)  rrF(a,s)^^Bi(F(do(a,s))) 

(B3)  ^rrF(a,s)  -4  [B,(F(s))  -4  Bi(F(do(a,  s)))) 

(B4)  -<r£  p(a,s )  -4  [~iBi(F(s))  -4  -‘Bi(F(do(a,  s)))] 

(Cl)  FlF(a,s)  ->  Bi^F(do(a,s))) 

(C2)  rrF(a,s)  -+^Bi(-iF(do(a,s))) 

(C3)  ^rrF(a,s)  -4  [Bi^F(s))  -4  Bi{-*F(do(a,  s)))] 

(C4)  --^(a,®)  -4  b£M-F(s))  -4  ^(-.Fido^s)))] 

Properties  (B3)  and  (C3)  show  that  positive  beliefs  remain  unchanged  after 
performance  of  an  action  as  long  as  -i/W  F(a,  s)  and  -> r~  F(a,s)  holds.  Prop¬ 
erties  (B4)  and  (C4)  show  that  negative  beliefs  remain  unchanged  after  perfor¬ 
mance  of  an  action  as  long  as  -> r^F(a,s)  and  -> r^F(a,s)  holds. 

Definition  1.  Regression  operator.  We  define  a  regression  operator  Rt  from 
formulas  in  Lm  to  formulas  in  Lm- 

1.  When  W  is  a  non  fluent  atom,  including  equality  atoms,  or  when  W  is  a 
fluent  atom  whose  situation  argument  is  the  constant  SO,  Rt\W]  —  W. 

2.  When  W  is  an  atom  formed  with  fluent  F  of  the  form  F(t,  do(a,  <r))  whose 
successor  state  axiom  in  T  is  8  VaVsVx[F(x,  do(a,  s ))  4-4  L>f]  then: 

7  Here,  we  use  the  symbol  V  to  denote  the  universal  closure  of  all  the  free  variables  in 
the  scope  of  V. 

8  We  use  the  notation  x  for  the  tuple  of  variables  x\,...,xn,  and  Vx  for  Van  . . .  'ixn\ 
<?F.{x/t,  a/a,  s/cr}  denotes  the  result  of  the  application  of  the  substitution 
{x/t,  a/a,  s/er}  to  formula 
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i?T[F(t,  do(a,  <r))]  =  -Rr[^F-{x/t,  a/a,  s/cr}] 

3.  When  W  is  an  atom  of  the  form  Poss(a(t),  cr),  whose  action  precondition 
axiom  is  VsVxPoss(a(x),  cr)  4A  77a(x,  s)  then: 

RT[Poss(a(t),s)]  =  RT[na(x,s).{x/t,s/a}] 

4.  When  W  is  a  modal  literal  of  the  form  P,(P(t,  do(a,  a)))  or 
do(a,a)))  whose  successor  belief  state  axioms  in  T  are  VaVsVx[Rj(P(x,  do(a,  s))) 

and  VaVsVx[B,(->F(x,  do(a,  s)))  <A(£;2if]  then: 

RT[Bi(F(t,do(a,cr)))]  =  PT^q.F-lx/t,  a/a,  s/<r}]  and 
Rr[Bi(->F{ t,  do(a,  <r)))]  =  RT[$ia,F ■{*/*,  a/a,  s/<r}] 

5.  When  W  is  a  formula  in  Lm  9,  Rt[~' W]  =  -> Rt\W\  and  = 

3  xRt[W]. 

6.  When  W\  and  W-j  are  formulas  in  Lm  ,  Rt[W\  V  Wf\  =  Rt[Wi\  V  Rt[W-2\- 

Theorem  2.  Let  To  be  a  set  of  closed  sentences  in  Lm,  without  Poss  predicate, 
and  whose  situation  argument  in  fluents  is  SO.  Let  Tss  be  the  set  of  precondition 
axioms  and  of  successor  axioms  for  the  fluents  of  a  given  application.  Let  Tu 
be  the  set  of  unique  name  axioms.  We  use  notations  T  =  Tu  U  Tss  U  To  and 
T'  —  Tu  U  To .  Let  Rf{<p)  be  the  result  of  repeated  applications  of  Rt  until  the 
result  is  unchanged.  Let  sgr  be  a  ground  situation  term. 

We  have  h  T  ->  W(v)  iffbT  f^[W(sgr)] 

For  the  proof  we  can  use  the  same  technique  as  Scherl  and  Levesque  in  [SL93] , 
but  the  proof  is  much  more  simple  because  we  do  not  have  explicit  accessibility 
relation  to  represent  modal  operators  (see  next  section). 

This  theorem  shows  that  to  prove  W  in  situation  sgr  comes  to  prove  R^\W] 
in  situation  SO,  droping  axioms  of  the  kind  (Gl),  (G2),  (G3)  and  (G4). 

Theorem  2  can  be  used  for  different  purposes.  The  most  important  of  them, 
as  mentioned  by  Reiter  in  [Rei99],  is  to  check  whether  a  given  sequence  of  ac¬ 
tions  is  executable,  in  the  sense  that  after  performing  any  of  these  actions,  the 
preconditions  to  perform  the  next  action  are  satisfied.  Another  one,  is  to  check 
whether  some  property  holds  after  performance  of  a  given  sequence  of  actions. 
These  two  features  are  essential  for  plan  generation. 

We  also  have  the  following  theorem. 

Theorem 3.  Let  A  be  a  formula  of  the  form  F,  BiF  or  Bj-iF,  where  F  is 
an  atom  formed  with  a  fluent  predicate.  Let  T  be  a  theory  such  that  for  every 
successor  axiom  of  the  form:  A(x,do(a,s))  <->■  T^"(x,  a,  s)  VA(x,  s)  A~iTJ(x,  a,  s), 
there  is  no  other  variable  that  occurs  in  TjJ"  or  TJ  than  the  variables  in  x,  or  a 
or  s.  Let  <f)(s)  be  a  formula  in  Lm  such  that  the  only  variable  that  occurs  in  <j) 
is  s. 

If  for  every  ground  formula  A(50)  we  have  either  h  T  -)  A(S0)  or  h  T  A 
-'A(S'O),  then,  for  every  ground  situation  term  sgr,  which  is  a  successor  situation 
of  SO,  we  have  either  h  T  A  </,(sgr)  or  b  T  — >■  -'<t>(sgr). 

9  The  definition  of  Rt  for  universal  quantifier  V,  conjunction  A,  implication  -A  and 
equivalence  <A,  is  directly  obtained  from  the  usual  definitions  of  these  quantifier  and 
logical  connectives  in  function  of  3,  ->  and  V. 
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The  proof  is  by  induction  on  the  complexity  of  the  formula  <f>  in  SO,  and  by 
induction  on  the  depth  of  the  term  sgr.  Theorem  3  intuitively  says  that  if  we 
have  a  complete  description  of  what  the  agents  believe  in  SO,  then  we  have  a 
complete  description  of  their  beliefs  in  every  successor  situation. 

4  Related  works 

In  [SL93]  Scherl  and  Levesque  have  defined  an  extension  of  Situation  Calculus  to 
Epistemic  Logic  for  a  unique  modal  operator,  but  without  any  restriction  about 
formulas  that  are  in  the  scope  of  the  modal  operator. 

In  their  approach,  the  first  idea  is  to  define  the  modal  operator  Knows  in 
terms  of  an  accessibility  relation  K  which  is  explicitly  represented  in  the  ax- 
iomatics.  Formally,  they  have:  Knows(<f>,s)  =f  Vs' (K (s',  s)  — ■»  The  sec¬ 

ond  idea  is  to  define  knowledge  change  by  defining  accessibility  relation  change. 
Moreover,  two  kinds  of  actions  are  distinguished:  knowledge-producing  actions, 
denoted  by  a\ ,...,an,  and  non-knowledge-producing  actions.  Each  action  a,- 
informs  the  agent  in  the  situation  do(cq,s)  about  the  fact  that  some  formula  pi 
is  true  or  false  in  the  situation  s.  It  is  assumed  that  the  action  a,-  does  not  change 
the  state  of  the  world.  From  a  technical  point  of  view,  after  the  performance  of 
action  a,-,  relation  K  selects,  in  the  situation  do(cq,  s),  those  situations  where  pi 
has  the  same  truth  value  as  it  has  in  the  situation  s.  For  instance,  if  p,  is  true 
in  s,  then  situations  s',  which  are  accessible  from  s  and  where  p,  is  false,  are 
no  more  accessible  from  do(a ;,  s).  Notice  that  if  pi  is  false  in  all  the  situations 
which  are  accessible  from  s,  there  is  no  situation  accessible  from  do(aj,  s).  That 
means  that  the  agent  believes  any  formula. 

This  problem  disappear  in  the  logical  framework  presented  by  Shapiro  et  al. 
in  [SPLOO],  where  a  plausibility  degree  pl(s)  is  assigned  to  all  the  situations.  From 
the  accessibility  relation  B(s',s),  an  accessibility  relation  Bmax(s',s)  between  s 

clef 

and  the  most  plausible  situations  can  be  defined  by  Bmax{s' ,  s)  -  B{s',s)  A 
\/s"(B(s" ,s)  — >■  pl(s')  <  pl(s")).  Then,  the  fact  that  an  agent  believes  <f)  in  s 

is  defined  as  Bel((f>,s)  =f  Vs' (Bmax(s' ,  s)  — >  Here,  an  agent  can  consis¬ 

tently  believe  <f>  in  do{a,s),  while  he  believed  -«j)  in  s,  provided  there  exists  at 
least  one  most  plausible  situation  related  to  do{a,s)  where  <t>  holds. 

For  a  non-knowledge-producing  action  a,  it  is  assumed  that  knowledge  changes 
in  the  same  way  as  the  world  does.  That  is,  if  a  situation  s'  is  accessible  from 
s,  the  situation  do(a,  s')  is  accessible  from  do[a,  s)  as  well.  In  formal  terms,  the 
evolution  of  relation  K  is  defined  by  the  following  axiom  10: 

Poss{a,  s)  — > 

[K{s" ,  do(a,  s))  o  3s'A'(s',  s)  A  s"  =  do(a,  s')  A  Poss(a,  s') A 
((-i(a  =  a i)  A  ...  A  ->(a  =  an))V 
a  =  ai  A  (pi(s)  o  pi(s'))V 

a  =  an  A  (p„(s)  -H-  pn  (s')))] 

10  In  fact,  condition  Poss(a,  s')  is  not  present  in  [SL93],  it  was  added  in  [LL98], 
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This  successor  axiom  does  not  explicitly  define  which  formulas  are  true  or 
false  in  do(a,s').  From  the  examples  presented  in  their  paper  we  understand 
that  the  truth  value  of  formulas  in  situations  like  s"  is  defined  by  the  successor 
state  axioms  of  the  type  (G2).  That  implicitly  means  that:  i)  whenever  some 
action  has  been  performed  the  agent  knows  that  this  action  has  been  performed, 
ii)  the  agents  knows  the  effects  of  all  the  actions,  i.e  he  knows  all  the  successor 
state  axioms,  and  iii)  when  the  agent  get  information  through  a  knowledge- 
producing  action,  this  information  is  always  true  information,  in  the  sense  that 
this  information  is  true  in  the  situation  s  where  he  is. 

How  this  formalisation  could  be  extended  to  the  context  of  multi-agents?  The 
fact  i)  cannot  be  accepted  in  general.  We  can  accept  that  an  agent  knows  that 
an  action  has  been  performed  when  it  has  been  performed  by  himself,  but  not 
necessarily  when  it  has  been  performed  by  another  agent.  This  problem  could 
be  solved  by  defining  as  many  accessibility  relations  A',-  as  there  are  distinct 
agents,  and  by  distinguishing  for  each  agent  those  actions  /?i, . . . ,  l3m  which  are 
performed  by  the  agent  i.  For  an  action  a  which  is  neither  of  the  sort  /3;  nor 
otk ,  the  fact  that  knowledge  does  not  change  can  be  represented  by  the  fact  that 
accessible  situations  from  do(a,s)  are  the  same  as  accessible  situations  from  s. 
That  could  lead  to  successor  axioms  for  each  relation  A',-  of  the  form: 

Poss(a,  s)  -> 

[Ki(s",  do(a,  s))  fA 

(Ki(s",s)  A  ~’(a  =  ari)  A  . . .  A  ->(a  =  an)  A  ->(a  =  /?i)  A  ...  A  — >(a  =  /?m))V 
(3s1  K{ (s',  s)  A  s"  =  do(a,  s')  A  Poss(a,  s') A 
(a  =  ft  V  ...  V  a  =  pmV 
a- ax  A  (pi(s)  *Api(s'))V 

a  =  a„  A  (pn(s)  p„(s')))] 

However,  even  with  this  extension  there  are  still  the  problems  related  to  ii) 
and  iii).  For  ii),  the  problems  is  that  in  real  situations  agents  may  have  wrong 
beliefs  about  the  the  evolution  of  the  world.  For  instance,  an  agent  may  believe 
that  droping  a  fragile  object  make  it  broken,  while  another  agent  may  believe 
that  the  object  is  not  necessarily  broken,  depending  on  his  weight  or  on  other 
particular  circumstances.  This  raises  the  question  of  how  to  represent  in  this 
framework  different  evolutions  of  the  world  in  the  context  of  different  agents 
beliefs?  May  be  a  possible  answer  is  to  have  different  successor  state  axioms,  for 
the  same  fluent,  to  represent  the  “true”  evolution  of  the  world,  and  to  represent 
the  evolution  of  the  world  in  the  context  of  each  agent’s  beliefs.  That  is,  more  or 
less,  the  idea  we  have  proposed  in  this  paper  with  the  axioms  of  the  type  (G3) 
and  (G4). 

For  iii) ,  the  problem  is  that  there  are  applications  where  agents  may  receive 
information  from  different  sensors,  or  from  other  agents,  some  of  them  are  not 
necessarily  reliable  and  may  communicate  wrong  information.  Here  again,  we 
believe  that  axioms  like  (G3)  and  (G4)  are  a  possible  solution,  because  they 
allow  us  to  represent  communication  actions  whose  consequences  are  to  generate 
wrong  agents  beliefs. 
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5  Conclusion 

In  conclusion,  we  have  presented  a  general  framework  to  solve  the  frame  problem 
in  the  context  of  a  limited  extension  of  Situation  Calculus  to  Epistemic  logic. 
Even  if  for  this  solution  strong  restrictions  are  imposed  on  the  language  Lm  ,  we 
can  express  non  trivial  properties  like:  \/s'\/x(Brposition(x,  s)  — >■  position(x ,  s)) , 
which  means  that  in  every  situation  the  robot  has  true  beliefs  about  his  position, 
or  VsVx(Brobstacle(x,s)  — >  Bpobstacle(x,s)),  which  means  that  the  robot’s  be¬ 
liefs  about  obstacles  are  a  subset  of  the  pilot’s  beliefs  about  obstacles.  Also,  since 
in  the  (KD)  logics  we  have  5,(/ A/')  Bj/A-B;/',  it  would  be  a  trivial  extension 

to  Lm  to  allow  conjunction  of  literals  in  the  scope  of  modal  operators.  Finally, 
the  regression  operator  Rp  allows  us  to  check  whether  these  kinds  of  properties 
can  be  derived  from  To .  The  implementation  of  a  modal  theorem  prover  for  the 
restricted  language  Lm  should  not  be  a  big  issue.  We  are  currently  working  on 
this  aspect. 
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Abstract.  This  paper  introduces  a  procedural  approach  to  perform  rule 
based  abduction  in  a  knowledge  base.  In  this  context  a  knowledge  base 
is  realised  as  a  normal  abductive  logic  program,  and  an  observation  can 
be  either  a  literal  or  a  rule.  A  SLDNF  resolution  based  proof  procedure 
is  employed  to  achieve  this  rule  based  abduction.  It  is  shown  that  using 
this  algorithm,  one  can  always  find  a  minimal  explanation  for  the  obser¬ 
vation  if  there  exists  such  an  explanation. 

Key  words:  Abduction,  knowledge  representation,  nonmonotonic  rea¬ 
soning. 


1  Introduction 

Abduction  plays  a  vital  role  in  commonsence  reasoning,  knowledge  representa¬ 
tion,  and  database  update.  Basically,  if  the  set  of  abducibles  is  constructed  not 
only  with  the  facts/rules  from  the  knowledge  base  but  also  from  the  beliefs  of 
other  agents  regarding  the  observations,  then  there  is  always  a  chance  of  making 
the  knowledge  base  more  consistent  and  flexible.  In  nonmonotonic  reasoning  a 
knowledge  base  always  changes  when  ever  a  new  observation  is  provided.  In  such 
a  situation  there  are  three  possible  effects  on  the  existing  knowledge  base. 

1.  The  observation  is  already  deducible  from  the  knowledge  base,  i.e.  the  ob¬ 
servation  can  be  explained  with  the  help  of  the  existing  knowledge. 

2.  A  part  of  the  knowledge  base  with  the  new  observation  is  able  to  derive  the 
other  part  of  the  knowledge  base,  if  so  the  size  of  the  existing  knowledge 
base  can  be  reduced. 

3.  Some  facts/rules  must  be  added  to  the  existing  knowledge  base  to  explain 
the  observation. 

An  observation  can  be  a  fact,  rule  or  a  program  .  By  finding  the  explanation  to 
the  observation  and  by  adding  it  to  the  knowledge  base  we  eventually  change 
the  knowledge  base  from  an  old  state  to  a  new  state. 

In  this  paper  we  mainly  concentrate  on  such  situations  where  observations  are 
rules.  When  an  observation  is  a  rule,  we  propose  that  the  body  part  of  the  rule 
must  be  explained  first  to  form  a  new  knowledge  base  which  intern  explains 
the  head  of  the  rule.  Then  the  union  of  the  explanations  gives  the  complete 
explanation  for  the  observation  (the  rule) . 
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We  will  provide  a  SLDNF  resolution  approach,  to  update  the  knowledge  base 
when  an  observation  is  either  a  literal  or  a  rule.  The  paper  is  organized  as 
follows.  In  the  next  section  we  introduce  basic  definitions  and  concepts  about 
abductive  logic  programs.  In  section  3  we  outline  the  basic  approach  for  our  rule 
based  abduction  through  different  examples.  In  section  4,  based  on  the  basic 
idea  presented  in  section  3,  we  formalize  the  procedures  of  abduction.  Finally, 
in  section  5  we  discuss  related  work  and  conclude  the  paper. 


2  Definitions  and  Concepts 

We  first  briefly  explain  the  SLDNF  proof  procedure  that  we  will  use  throughout 
this  paper.  The  linear  resolution  with  selection  function  (or  SL-resolution  )  is  a 
restricted  form  of  linear  resolution.  The  main  restriction  is  effected  by  a  selection 
function  which  chooses  from  each  clause  one  single  literal  to  be  resolved  upon 
in  that  clause.  SL-resolution  operates  on  chains  rather  than  clauses  and  hence 
strictly  is  not  a  form  of  resolution.  It  does,  however  employ  ideas  of  unification 
and  resolution.  SLD  resolution  (  SL  resolution  for  definite  clauses  )  is  described 
as  follows.  Let  P  be  a  definite  program  and  G  be  a  goal.  An  unrestricted  SLD 
derivation  of  P  U  Q  consists  of  a  sequence  G0  =  G,  G\,  ■  ■  ■  of  goals,  a  sequence 
Ci,  C2,  ••■of  variants  of  clauses  in  program  V  (called  the  input  clauses  of  the 
derivation),  and  a  sequence  0(i),  0(2),  •  •  •  of  substitutions.  Each  non-empty  goal 
Gi  contains  one  atom,  which  is  selected  atom  of  Gt.  The  clause  Gi+\  is  said  to 
be  derived,  from  Gi  and  Ci  with  substitutions  0,  and  is  carried  out  as  follows. 
Suppose  Gt  is  4-  Alt  ■  ■  • ,  Ak,  ■  ■  • ,  Am  and  Ak  is  the  selected  atom.  Let  Ci  = 
A  4-  Bi ,  •  •  • ,  Bn  by  any  clause  in  P  such  that  A  and  Ak  are  unifiable  with  any 
unifier  0.  Then  Gi+1  is  4-  (Aly  ■  ■  • ,  Afc_i,  Bu  -  ■  • ,  Bn,  Ak+i,  •  •  • ,  Am)0  and  0i+i 
is  0.  An  unrestricted  SLD-refutation  is  a  derivation  ending  at  an  empty  clause. 
SLDNF  resolution  is  essentially  SLD-resolution  scheme  augmented  by  negation 
as  failure  inference  rule.  The  completeness  and  soundness  are  discussed  by  Lloyd 

[4]- 

A  rule  is  of  form 


Ao  4  Ai ,  *  *  *  ,  Amj  TlOtBrn^-i ,  *  •  •  ,  TlOtBn,  (1) 

where  Ao,  •  •  • ,  Am,  Bm+ i,  •  •  • ,  Bn  are  atoms  of  language  C.  A  normal  logic  pro¬ 
gram  is  a  finite  set  of  rules  of  form  (1).  A  rule  of  form  4—  Ai,  •  •  • ,  Am,notAm+ i, 
•  •  -  ,  notAn  is  called  a  normal  goal.  In  a  logic  program  P,  a  rule  without  body 
A  4—  is  called  a  fact.  A  term, atom, literal ,  rule  or  program  is  ground  if  no  variable 
occurs  in  it.  It  should  be  also  noted  that  any  free  variable  in  a  rule  is  assumed 
to  be  universally  quantified.  If  V  does  not  contain  any  constant  symbols,  we  will 
assume  one  in  V. 

Now  we  define  a  normal  abductive  logic  program  to  be  a  pair  <  V,  A  >,  where 
V  is  a  normal  Logic  Program  and  A  is  the  set  of  Abducibles.  An  abducibles  is  a 
rule.  Let  <  V,  A  >  be  a  normal  abductive  logic  program  and  Q  is  a  goal  (obser¬ 
vation).  Note  that  Q  can  be  a  literal  or  a  rule.  We  first  consider  the  case  that  Q 
is  a  literal.  If  Q  is  a  ground  atom  (we  also  call  Q  is  a  positive  observation),  then 
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proper  hypothesis  is  introduced  to  or  removed  form  the  current  knowledge  base 
to  explain  the  observation  by  showing  that  Q  can  be  derived  from  the  resulting 
knowledge  base.  If  Q  is  a  negative  ground  atom  (we  also  call  Q  is  a  negative  ob¬ 
servation),  then  proper  hypothesis  is  introduced  to  or  removed  form  the  current 
knowledge  base  to  explain  the  observation  by  showing  that  the  corresponding 
ground  atom  of  Q  cannot  be  derived  from  the  resulting  knowledge  base. 

Definition  1.  A  pair  (£,  T)  is  an  explanation  of  a  positive  observation  (or 
negative  observation,  resp.)  Q  with  respect  to  an  abductive  logic  program  <V,A> 

if 

1.  G,  (or  (Vu£)-  T\fQx,  resp.); 

2.  ((P  U  £)  —  F  is  consistent;  and 

3.  £  <Z  A  andF  Q  AC\V. 

An  explanation  (£,  F )  of  an  observation  Q  is  minimal  if  for  any  explanation 
{£ J-')  of  Q,  £'  C  £  and  F'  C  F  imply  £'  =  £  and  F'  =  F . 


3  The  Approach 

In  this  section  we  describe  the  basic  idea  of  our  approach  of  abductive  reasoning 
by  illustrating  several  examples. 

Example  1.  Let  <V,A>  be  an  abductive  program  such  that 
V: 

Bird(tweety) 

Bird(opus ) 

Broken-Wing(tweety)  , 

Ab(x)  «—  Broken-Wing(x), 

Flies(x)  Bird(x),notAb(x), 

A'. 

Broken-Wing(x) 

The  explanation  for  a  positive  observation  Q  =  Flies(tweety)  can  be  deri¬ 
ved  from  the  SLDNF  tree  showed  in  Figure  1.  That  is,  the  observation  Q  = 
Flies(tweety)  has  an  explanation  (£,F)  =  (0,  {Broken-Wing(tweety)  ■<—}). 

On  the  other  hand,  an  explanation  (£,  F)  for  the  negative  observation  Q  = 
notFlies(opus)  is  ({ Broken-Wing(opus )  <—  },0)  and  can  be  derived  from  the 
SLDNF  tree  showed  in  Figure  2.  Note  that  to  explain  the  fact  notFlies(opus) 
we  have  to  add  the  fact  Broken-wing(opus )  <—  to  the  logic  program  V  ■ 

Now  we  consider  the  case  that  the  observation  is  a  rule.  An  observation  rule 
is  a  rule  of  the  form 

Aq  4  A\ ,  •  • ' ,  Am ,  notBm+i ,  •  •  • ,  notBn ,  (2) 

1  Here  we  use  Q  to  denote  the  complementary  literal  of  Q.  For  instance,  if  Q  is  notF, 
then  Q  is  F. 
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Flies  ( tweety) 


'  Ab  ( tweety) 


-Broken-wing(  tweety ) 


-  Bird  (tweety),  not  Ab  ( tweety) 


\  delete  Broken-wing  ( tweety ) 


O  Failed 


Fig.  1.  The  SLDNF  tree  for  V  U  {<—  Flies(tweety)} . 
'  Flies  (  opus) 

■*—Ab  ( opus  ) 


-Bird  (  opus  ) ,  not  Ab  (  opus  ) 


Broken-Wing  ( opus  ) 

,  Add  Broken-Wing(  opus ) 

"b 


failed 

Fig.  2.  The  SLDNF  tree  for  V  U  {4—  Flies(opus)} . 


or  of  the  form 


notA0  4-  Ai,-  ■  ■  ,Am,notBm+i,  ■  ■  ■  ,notBn.  (3) 

The  former  is  called  a  positive  observation  rule,  while  the  latter  is  called  a  ne¬ 
gative  observation  rule. 

Definition  2.  Given  an  abductive  program  <  V,A  >,  a  pair  (£,  IF)  is  an  ex¬ 
planation  of  the  observation  rule  Q  if  there  exist  E\,  ■  ■  ■ ,  £n  and  (F\,  •  ■  • ,  J-n  such 
that 

1.  (V  U  £1)  -  (Fi  Ai,-  ■  ■,  (‘ Pm  u  £m )  -  Tm  1-  Am,  ( Vm+l  U  £m+ 1)  —  drTn+1  I f 

Bm+i,  CPU  £n)  -  Bn  V  Bn; 

2.  (V  U  £ )  —  T  I-  Ao  (or  V  U  £)  —  (F  \f  Aq  if  the  head  of  Q  is  notAo ),  where 
£\  U  •  •  •  U  £n  C  £,  and  T\  U  •  •  •  U  J~n  Q  B ; 

3.  5  0^  =  0  and  (V  U  £)  —  T  is  consistent; 

4.  £  C  A  and?  C  A  nV. 

(£,  T )  is  a  minimal  explanation  for  Q  as  there  does  not  exist  another  explanation 
(£ ' ,  J-')  of  Q  such  that  £ '  C  £  and  4F'  C  J- . 

Now  let  us  see  how  we  can  use  the  SLDNF  resolution  proof  to  achieve  ab¬ 
ductive  reasoning  in  which  an  observation  is  a  rule. 
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Example  2.  Consider  an  abductive  program  <V,A>  where 

V: 

Sci(Kiran) 

Math(Kiran) 

Crazy(x)  Sci(x),notCs(x ), 

Happy(x)  <—  Crazy(x),  Math(x), 

Crazy(x )  «—  Phi[x). 

A: 

Happy(x)  f-  Crazy( x),  Math(x), 

Phi(Kiran)  . 

Consider  the  observation  Q: 

notHappy(kiran)  Crazy  (kiran),notCs(kiran). 

According  to  our  definition,  we  need  to  find  £i,£2  and  Pi ,  P2  such  that  (V  U 
£\)  —  Pi  h  Crazy(Kiran),  (V  U  £2)  —  P2  V  Cs(Kiran),  and  (V  U  £\  U  £2) 
— ( Pi  UP2)  \f  Happy  (Kir  an). 

Note  any  finitely  failed  SLDNF  tree  of  V  U  {<—  Q}  implies  that  notQ  is 
derived  from  V .  The  detailed  discussion  is  referred  to  in  [2].  Now  we  have  the 
following  revised  SLDNF  trees  for  deriving  Crazy(Kiran)  and  notCs(Kiran) 
respectively. 


- —  Crazy  ( kiran )  Cs(kiran) 


Fig.  3.  The  SLDNF  tree  for  V  U  {<—  Crazy(Kiran)} . 

According  to  our  previous  discussion,  from  Figure  3  we  can  see  that  a  minimal 
explanation  for  Crazy(Kiran)  is  (0,0).  Since  P  I /  CS(Kiran),  the  minimal 
explanation  for  notCs(Kiran)  is  (0,  0). 

From  Figure  4  we  can  see  that  to  achieve  (VU £)—P  I /  Happy(Kiran),  there 
are  four  possible  explanations: 

1.  (£,P)  =  (0,  {Happy(Kiran)  <—  Crazy(Kiran),  Math(Kiran)}), 

2.  ( £,P )  =  (0,  {Sci(Kiran)  ■<—}), 

3.  (£,P)  =  ($,{Math(kirari)  <-}), 

4.  (£,P)  =  ({Cs(Kiran)  ^},0). 

However  explanations  2  and  3  are  not  satisfying  our  definition  P  C  A  n  V, 
explanation  4  is  not  satisfying  our  definition  that  £  C  A  .  Also,  as  explanati¬ 
ons  for  both  Crazy(Kiran)  and  notCs(kiran )  is  (0,0),  we  can  conclude  that 
explanation  1  is  the  only  and  final  minimal  explanation  for  Q  ■ 
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-  Happy(kiran) 


remove  Happy (xy*—  Crazy(X) 
,Math(X) 


-  Crazy( kiran ), Math( kiran ) 


-Sci(kiran),  not  Cs(kiran),  Math(kiran) 


Remove  y  Add  y 

Sci(Kiran)+yr'  Cs(kiran)-' 


Failed  Faded 


Plii(kiran),  Math(kiran) 


x  Remove 
\  Math( Kiran)*— 


Fig.  4.  The  SLDNF  tree  for  (V  U  {Phi(kiran)  <— })  U  {<—  Happy  (Kir  an)}. 


4  Formal  Descriptions 

Based  on  the  ideas  presented  previously,  in  this  section  we  describe  the  formal 
procedures  for  our  rule  based  abduction.  As  it  has  been  described  in  section  3, 
our  approach  is  based  on  the  SLDNF  resolution.  In  the  SLDNF  resolution  proof, 
the  negation  is  proved  from  a  logic  program  by  finite  failure.  However,  it  is  well 
known  that  it  is  possible  that  a  SLDNF  tree  may  include  infinite  branch.  In  this 
case,  no  result  can  be  proved  from  the  SLDNF  resolution  proof.  For  instance, 
given  a  logic  program  V  =  {A  notB.  B  not  A},  no  finite  SLDNF  tree  exists 
for  V  U  {<—  A}.  To  avoid  this  problem,  in  this  context,  we  assume  that  during 
the  abduction  process,  each  SLDNF  tree  is  finite  in  the  sense  that  each  branch 
in  the  SLDNF  tree  is  finite. 

To  simplify  our  following  description,  we  also  introduce  some  useful  notions 
about  SLDNF  trees.  Observing  a  SLDNF  tree  described  previously,  e.g.  figure 
Fig.3  on  page  6,  if  we  use  no,  •  •  •  ,  n*,  ■  ■  ■  to  denote  all  nodes  in  a  SLDNF  tree, 
a  SLDNF  tree  can  be  actually  presented  by  a  set  of  branches  no  — t  n\  ->  •  •  •, 
•  •  ■ ,  no  — >  nk  — >  ■  ■■■  Each  branch  in  the  SLDNF  tree  starts  from  the  root  node  no. 
Consider  the  SLDNF  tree  in  Fig.3,  it  is  quite  clear  that  it  can  be  described  as  two 
branches  no  — >  ni  — >  — >  O  and  no  — >  n3  — >  Failed,  where  nodes  no,  n*,  n 2,  n 3 

are  •*—  Crazy(Kiran),  <—  Sci(Kiran),notCs(Kiran),  notCs(Kiran)  and 
Phi(Kiran)  respectively.  Node  O  indicates  the  end  of  a  successful  branch, 
and  Failed  indicates  the  end  of  a  finitely  failed  branch.  We  also  call  a  segment 
of  a  branch  starting  from  the  root  node  a  path  in  a  SLDNF  tree. 

Algorithm  1.  Main(<  V,A  >,  Q) 

Function:  Explain  the  observation  Q  in  logic  program  <  V.A  > 

Input:  A  logic  program  (knowledge  base)  <  V,A  >  and  an  observation  Q,  where 
the  body  of  Q  consists  of  P1;  •  •  • ,  PminotPm+i,  ■  ■  ■ ,  notPn,  Q  is  the  head  of  the 
observation 

Output:  <  £,!F  >  ,  V 

Begin 

£  =  0  ;  T  =  0; 

For  each  i  (1  <  i  <  m)  Do  Insert  Hypothesis  (  <V,A>,Pi ); 
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For  each  i  (m  +  1  <  *  <  n)  Do  Delete  Hypothesis  (  <V,A>,Pi)', 

If  the  observation  is  a  positive  observation 
Then  Insert  Hypothesis(  <V,A>,Q)  ; 

If  the  observation  is  a  negative  observation 
Then  Delete  Hypothesis  (  <  V,A  >,Q)  ; 

Print  (' P'  ,  <  £,T  >); 

End. 

The  function  of  algorithm  Main  (<  V,A  >,Q)  is  to  split  the  observation, 
which  is  a  rule,  into  a  set  of  literals,  deriving  the  explanation  for  each  literal  and 
updating  the  program  with  the  explanations  starting  from  the  body  part  of  the 
observation.  Once  the  body  part  is  explained  and  program  is  updated,  the  up¬ 
dated  program  is  used.  It  should  be  noted  that  we  do  not  restrict  an  observation 
to  be  ground.  For  a  non-ground  observation,  as  we  mentioned  earlier,  we  need  to 
consider  all  its  ground  forms  which  are  obtained  by  substituting  variables  in  the 
observation  with  elements  of  program  universe  U  (V).  This  consideration  will  be 
adopted  in  our  following  algorithms. 

Algorithm  2.  Insert  Hypothesis(<  V, A  >,Q) 

Function:  Explain  the  observation  Q  WRT  <  V.A  > 

Input:  A  logic  program  <  V,A  >  and  an  observation  Q 
Output:  An  explanation  <  £,F  >  such  that  (  V  U  £  )  -  T  f=  Q) 

Begin 

Let  V  =  V  and  Path  -  0; 

Loop 

T  =  a  finite  SLDNF  tree  for  V  U  {«—  (/}; 

If  there  is  a  successful  branch 
no  — >  rix  rii  -»  O  in  T  and  n,  e  A 

then  £n=  71,  ,  £  =  £  U  £n ,  V  =  V  U  £; 

Return  (<  £,  T  >,V)- 

Select  a  finitely  failed  branch  B  in  T  such  that  Path  C  B: 
no  — >  ■  ■  ■  — >  Tii  — >  Failed; 

(1)  If  rii  has  form  «—  Pn,  [not]Pi2,  ■  ■  ■ 

then  If  Pi  C  A  then  £n  =  Pn,  £  =  £  U  £n,  V'  =  {V  U  £); 

Return  (<  £,T  >,V)\ 

(2)  If  rii  has  form  notPn,  [not\Pi2,  ■  ■  ■, 

V  =DeleteHypothesis('P/,  notPn  <— ); 

EndLoop 

End. 

The  detailed  explanation  of  the  above  algorithm  is  referred  to  our  full  paper 
due  to  the  space  limit  of  this  paper.  Here  it  is  sufficient  to  highlight  some  key 
ideas  embedded  in  this  algorithm. 

The  algorithm  Main  splits  the  observation  into  a  set  of  literals  and  passes 
them  to  the  algorithm  Insert  hypothesis.  If  there  is  a  successful  branch  in 
the  SLDNF  tree,  then  the  process  stops.  Otherwise,  we  need  to  add  or  remove 
some  rules  from  V  so  that  the  goal  <—  Q  can  be  achieved.  It  is  worth  noting  that 
if  a  sub  goal  notP  fails  in  some  branch  of  the  SLDNF  tree,  we  need  to  call 
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algorithm  DeleteHypothesis(P,)  notP  <—),  which  will  be  described  below,  in 
order  to  achieve  the  sub  goal  <—  notP.  Finally,  after  change  V  to  V  such  that 

V  F  Q.  First,  the  body  of  the  observation  is  explained  using  this  algorithm  by 
SLDNF  tree  for  Pll{f-  Q}  where  Q  is  an  atom  from  the  body  of  the  observation. 
Once  the  body  of  the  observation  is  explained  completely  and  a  new  program 

V  =  (V  U  £)  -  P  is  formed  to  explain  the  head  of  the  observation,  thereafter 
from  the  SLDNF  tree  for  V  U  {<—  Q\  where  Q  is  the  head  of  the  observation  we 
will  get  another  set  of  explanation  which  combined  gives  the  explanation  for  the 
observation. 

Algorithm  3.  Delete  Hypothesis(<  V,A  >,G) 

Function:  Explain  the  observation  Q  WRT  <  V,A  > 

Input:  A  logic  program  <  V,A  >  and  an  observation  Q 
Output:  An  explanation  <  £,P  >  such  that  (  P  [J  £  )  -  P  \f  G) 

Begin 

T  —  a  finite  SLDNF  tree  for  V'  U  {<—  G}', 

Loop 

Let  T  =  a  finite  SLDNF  tree  for  V  U  {<—  G}', 

For  each  successful  branch  B  in  T, 

(1)  If  there  exist  two  nodes  nt  and  n,;+]  in  B  such  that  n,  ->  n<+ 1, 
where  n,  and  n;+i  have  forms  <—  Pn,  [not\Pi2,  ■  ■  •  and  e- 

[nof]P,2,  •  ■  •  respectively,  then  Pn  —  Pn\ 

If  Tn  C  A  n  V  Then  T  =  JFu ?nt  V  =  V  - 
Return  {V  ,  <  >  ); 

(2)  If  there  exist  two  nodes  rij  and  nl+i  in  B  such  that  n*  — >•  rij+i, 
where  and  ni+1  have  forms  <-  notPn,  [not]Pa,  •  ■  ■  and  <— 

[not\Pi2,  •  •  •  respectively,  then  £n  —  Pn; 

if  £n  C  A  then  £  =  £  U  £n,V  =  V  U  £; 

Return  (V  ,  <  £,T  >); 

(3)  If  there  exist  two  nodes  n,  and  ni+ j  in  B  such  that  rii  ->  ni+ 1, 
where  n,  and  ni+i  have  forms  <—  P%i,  [not}P%2,  ■  ■  ■  and 
<-  [not\Piii,  ■■■ ,  [not}Pnq,  [ not]Pi2 ,  •  •  • 

respectively  in  which  a  rule  r:  Pn  <—  \not\Pi n,  •  •  • ,  [not]Pnq 
is  used  to  derive  node  ni+ 1,  then  Tn  —  r; 

If  J-n  C  A  C\V  Then  J-  —  P  Li  Pn,  V  =  V  -  P; 

Return  {V  ,  <  £,  P  >)  ; 

T  —  a  finite  SLDNF  tree  for  P'{^  G}\ 

EndLoop 

End. 


Theorem  1.  Given-an  abductive  logic  program  <  V,  A  >  and  an  observation 
rule  Q .  If  there  exists  an  explanation  for  Q ,  then  the  algorithm  Main(<  V,  A  > 
,5)  will  always  return  an  explanation  ( £,P )  for  Q,  and  this  explanation  is  mi¬ 
nimal. 
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This  theorem  ensures  that  our  procedure  only  finds  the  minimal  explanation  for 
an  observation  with  respect  to  an  abductive  logic  program  if  there  exists  some 
explanations  for  this  observation. 

5  Concluding  Remarks 

In  this  paper,  we  propose  a  SLDNF  proof  approach  for  abductive  reasoning. 
Differently  from  previous  approaches,  our  approach  allow  the  observation  to 
be  a  rule.  Since  within  our  framework,  an  abductive  reasoning  is  translated 
into  a  SLDNF  proof,  our  approach  can  be  easily  implemented  based  on  the 
revision  of  traditional  SLDNF  proof  procedure.  Finally,  we  should  mention  that 
the  system  implementation  based  on  the  framework  proposed  in  this  paper  is 
being  undertaken  currently. 
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Abstract  Enlarging  the  class  of  tractable  SAT  problems  is  a  relevant 
topic  because  of  the  repercussions  in  both  practical  applications  and  the¬ 
oretical  grounds.  In  this  paper,  it  is  proved  that  some  non-clausal  Horn¬ 
like  formulas  can  be  solved  in  linear  time.  In  addition  to  its  theoretical 
importance,  this  result  has  a  special  practical  interest  because  Knowl¬ 
edge  Based  Systems  could  benefit  of  it  due  to  the  Horn-like  structure 
of  the  formulas.  In  order  to  prove  such  linearity  a  correct  refutational 
calculi  is  first  provided  and  second,  a  linear  algorithm  is  described. 


1  Introduction 

To  represent  knowledge  and  to  reason  with  non-clausal  formulas  is  a  matter 
of  high  importance  in  Artificial  Intelligence  and  more  generally  in  Computer 
Science.  However,  most  of  the  existing  methods  have  been  designed  for  clausal 
reasoning.  Thus,  the  two  main  methods  to  solve  problems  in  non-clausal  formulas 
transform  the  original  non-clausal  formula  to  a  clausal  formula.  Both  transfor¬ 
mations  have  severe  drawbacks  because  either  the  number  of  propositions  of  the 
transformed  formula  increases  exponentially  as  a  consequence  of  the  V/A  distri¬ 
bution  or  a  certain  number  of  artificial  literals  are  introduced  in  the  transformed 
formula  losing  the  logical  equivalence  relation  which  could  be  invalid  for  certain 
applications. 

In  this  paper,  we  identify  Negation  Normal  Form  (NNF)  formulas  Fx  A  F2  A 
. . .  A  Fn  having  a  Horn-like  structure.  Each  F;  is  a  disjunction  of  two  optional 
terms,  i.e.  Fx  =  NNF~  V  Cf :  the  first  one  is  a  NNF  formula  with  only  negated 
propositions,  noted  NNF~,  and  the  second  one  is  a  conjunction  of  non-negated 
atomic  propositions,  noted  C+ . 

The  kind  of  formulas  we  deal  with  in  this  paper  can  arise  from  an  original 
non-clausal  representation  of  the  problem.  They  are  compact  representations  of 
Horn  formulas  given  that  they  require  less  symbols  than  Horn  formulas  to  codify 
identical  problems;  this  reduction  can  be  in  an  exponential  rate. 

This  paper  is  structured  as  follows.  In  the  next  section  we  briefly  review  the 
research  about  non-clausal  tractable  reasoning  and  related  issues.  After  having 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  534-542,  2000. 
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done  the  formal  definition  of  the  mentioned  formulas,  we  define  a  sound  and 
refutational  complete  logical  calculi  and  finally,  we  detail  an  algorithm  in  pseudo¬ 
code  with  suitable  data  structures  to  resolve  SAT-problems  expressed  in  this 
kind  of  specified  formulas.  This  algorithm  is  showed  to  be  sound,  refutational 
complete  and  it  runs  in  strictly  linear  time. 

2  Related  Work 

Several  methods  have  been  developed  to  infer  with  non-clausal  formulas.  This 
is  the  case  of  Matings  [2],  Matrix  Connection  [3],  Dissolution  [9],  and  TAS  [7]. 
However,  no  studies  relative  to  NNF  tractability  employing  one  of  these  methods 
have  been  carried  out. 

As  far  as  we  know,  the  first  published  results  concerning  non-clausal  tractabil¬ 
ity  comes  from  [4-6]  where  a  strictly  linear  forward  chaining  algorithm  to  test 
for  the  satisfiability  of  certain  NNF  formulas  subclass  is  detailed.  Such  a  class 
embeds  the  Horn  case  as  a  particular  case.  In  [8]  a  linear  backward  algorithm  is 
given  for  the  same  NNF  subclass  of  formulas.  Following  this  research  line,  very 
recently  a  preliminary  version  of  this  article  has  been  presented  in  [1]. 

New  results  concerning  NNF  tractability  are  reported  in  [11]  where  a  method 
called  Restricted  Fact  Propagation  is  presented  which  is  a  quadratic,  incomplete 
non-clausal  inference  procedure.  More  recently,  in  [12, 13]  a  significant  advance 
has  been  accomplished.  The  author  defines  a  class  of  formulas  by  extending  Horn 
formulas  to  the  field  of  the  NNF  formulas.  Such  extension  relies  on  the  concept 
of  polarity.  A  method  to  make  inferences  and  potentially  to  detect  refutational 
formulas  is  designed.  In  [12],  a  SLD-resolution  variant  with  the  property  of  be¬ 
ing  refutationally  complete  is  showed  but  its  computational  complexity  is  not 
studied.  In  [13]  a  method  for  propositional  NNF  Horn-like  formulas  is  described 
and  it  is  stated  that  the  method  is  sound,  incomplete  and  linear. 

However,  concerning  the  last  issue,  no  algorithm  is  specified,  indeed  the  steps 
of  the  method  are  described  as  different  propagations  of  some  truth  values  in 
a  sparse  tree.  Then,  although  it  seems  that  the  number  of  inferences  of  the 
proposed  method  is  linear,  it  is  not  proved  the  resulting  complexity  (w.r.t.  the 
number  of  computer  instructions)  of  a  linear  number  of  truth  value  propagations 
on  the  employed  sparse  trees. 

3  Our  formulas 

Firstly,  we  introduce  our  class  of  non  normal  formulas. 

Definition  1.  A  literal  L  is  either  an  atomic  proposition  p  6  P,  noted  L+  or 
its  negation  ->p,  noted  L~ . 

Notation  From  now  on,  D  stands  for  a  disjunction  of  literals  (Li  V  . . .  V  ij.) 
and  C  denotes  a  conjunction  of  literals  [L\  A  ...  A  L*,).  D+  and  C+  (D~  and 
C~)  include  only  positive  (negative)  literals. 
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Definition  2.  A  CNF  formula  is  a  conjunction  of  disjunctions  of  literals  (D\  A 
. . .  A  Dm)  and  DNF  is  a  disjunction  of  conjunction  of  literals  (C\  V  ...  V  Cm). 
Also,  CNF+  and  DNF+  (resp.  CNF~  and  DNF~  )  includes  only  positive 
(resp.  negative)  literals. 

Definition  3.  A  clause  CF  is  a  disjunction  of  three  optional  terms  CF  = 
DNF~  V  CNF~  V  C+ .  Clauses  with  only  the  DNF~  V  CNF~  (resp.  C+ )  term 
are  said  negative  (resp.  positive)  clauses.  We  denote  the  empty  clause  by  □. 

A  formula  F  is  a  finite  conjunction  of  clauses  CF .  We  note  Fa  any  formula 
containing  the  empty  clause  □. 

Example  1.  An  example  of  the  kind  of  the  defined  formulas  is  F  =  { (pi ) ,  (p3 ) , 
(Pe),  (((“'Pi  A  ->p2)  V  -ip3)  V  ((-ip4  V  ->p5)  A  ->p6)  V  (p7  A p8)),  (-‘Pa)}-  For  the  non 
unitary  clause  we  have  DNF~  —  ((— <pi  A  ->p2)  V  -ip3),  CNF~  =  ((-i p4  V  ->p6)  A 
-ip6 )  ■  This  non  unitary  clause  is  equivalent  to  eight  clauses,  for  instance  two  of 
them  are:  (-ipi  V  -ip3  V  -p4  V  ~<p5  V  p7)  and  (->p2  V  -ip3  V  -p6  V  p8). 

Definition  4.  An  interpretation  I  assigns  to  each  formula  F  one  value  in  the 
set  {0, 1}  and  it  satisfies: 

—  A  literal  p  (->p)  iff  I (p)  =  1  (I(p)  =  0). 

—  A  disjunction  D  =  L\  V  . . .  V  L k,  iff  I (Li)  =  1,  for  at  least  one  Li. 

—  A  conjunction  C  =  L\  A  . . .  A  Lk,  iff  I  (Li)  =  1  for  every  Li. 

—  A  clause  CF  =  DNF~  V  CNF~  V  C+ ,  iff  I (DNF~)  =  1  or  I(CNF~ )  =  1 
or  I(C+)  =  1. 

—  A  formula  F  if  I  satisfies  all  clauses  CF  of  the  formula. 

An  interpretation  I  is  a  model  of  a  formula  F  if  satisfies  the  formula.  We  say 
that  F  is  satisfiable  if  it  has  at  least  one  model,  otherwise,  it  is  unsatisfiable. 

Definition  5.  Variable  Truth  Assignment  (VTA)  If  (p)  e  F  then  VTA 
derives  a  new  formula  F'  resulting  of  removing  from  F  the  unit  clause  (p),  the 
conjunctions  (-ipA-ipi  A. .  .A~> pk),  (-ipAD^  A. .  .A  Dfff),  and  all  the  occurrences 
of-'P- 

Definition  6.  And-Elimination  (AE)  Inference  rule  AE  derives  from  a  posi¬ 
tive  conjunction  clause  (piA. .  .Ap;A- .  .A pn),  the  unit  clauses  (pi), . . . ,  (p;), . . . ,  (pn). 

Definition  7.  Clause  Proof  A  refutation  of  a  formula  F,  is  a  finite  succession 
of  formulas  <  F\ ,  F2i  •  •  • ,  Fn  >  such  that  Fx  —  F,  Fn  =  Fa  and  for  each  1  <  i  < 
n-1,  either  Fi+ 1  =  VTA(Fi)  or  Fi+ 1  =  AE(Fi). 

Example  2.  A  proof  of  the  unsatisfiability  of  F  in  the  previous  example  is: 

{(Pi)>  fe),  (Pe),  (((“’Pi  A  — ’P2)  V  ->p3)  V  ((ip4  V  -ip5)  A  -ip6)  V  (p7  Ap8)),  (“’Ps)} 
\~VTA  { (p3 ) i  (pe),  (—'Ps  V  ((“ 'P4  v  ^p5)  A  ->p6)  V  (p7  Ap8)),  (“’Ps)} 
l-VTA  { (P6 ) j  (((“p4  v  -'Ps)  A  ->p6)  V  (p7  Aps)),  (“ ’P8)} 

'rVTA  {(P7  ApS),  (->p8)} 

{  (P7 )  i  (P8  )  ?  (“ ’P8  )  } 

Fkta  { (P7 ) j  n}  =  Fa 
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Theorem  1.  Soundness  F  h( vta+ab )  F'  =>  F  \=  F1 . 

The  proofs  of  the  soundness  of  each  rule  are  trivial  and  the  proof  of  the 
theorem  follows  straightforwardly  from  those  proofs. 

Theorem  2.  Refutational  Completeness  Let  F  be  an  unsatisfiable  formula; 
then  F  h  ( vta+ae )  Fa  ■ 

The  proof  is  by  induction  on  the  length  of  F,  i.e.  the  number  of  occurrences 
of  literals  in  F.  The  following  theorem  extends  completeness  to  atomic  clauses. 

Theorem  3.  Literal  Completeness  F  \=  ( L )  =4-  F  h  (vta+ae)  (L). 

Comparing  VTA/AE  proofs  with  known  automated  deduction  approaches 
as  for  instance  Analytic  Tableau,  Path  Dissolution,  Matrix  Connection,  etc.  is 
beyond  the  scope  of  this  article.  For  a  discussion  about  complexity  of  Tableau 
proofs  and  resolution  approaches  the  reader  can  see  [10,14]. 

4  Algorithm  Description 

The  principle  of  the  algorithm  is  the  application  of  the  VTA  and  AE  inference 
rules  until  one  of  the  following  facts  arise:  (1)  the  empty  clause  is  derived,  or 
(2)  no  more  new  unit  clauses  can  be  derived.  In  case  (1)  the  original  formula 
is  detected  to  be  unsatisfiable  meanwhile  in  case  (2)  the  formula  results  to  be 
satisfiable. 

Brief  description  of  the  algorithm  We  describe  roughly  the  steps  of  the 
algorithm.  Initially,  it  tests  whether  F  has  positive  unit  clauses.  In  the  negative 
case,  it  returns  “satisfiable”  because  all  the  clauses  have  at  least  one  negative 
literal.  So,  assume  that  ( p )  €  F.  Thus  F  is  satisfiable  iff  F.{p  +-  1}  is  satisfiable. 
In  other  words  F  is  satisfiable  iff  the  formula  F'  resulting  of  removing  from  F  the 
unit  clause  (p) ,  the  conjunctions  (ip  A  -> px  A ...  A  ip* )  and  (ip  A  A ...  A  ) 
and  all  the  remaining  occurrences  of  the  literal  ip,  is  satisfiable.  Thus,  whenever 
(p)  £  F  the  algorithm  removes  from  F  the  mentioned  elements,  i.e.  it  applies 
the  VTA  inference  rule.  This  operation  is  performed  by  the  algorithm  for  each 
positive  unit  clause  in  F.  Now,  observe  that  some  clauses  in  the  initial  formula 
can  become  positives  ( CF  =  C+)  because  of  the  removals  of  some  negative  parts. 
Also  due  to  these  removals,  a  pure  negative  clause  ( CF  —  DNF~  V  CNF~ )  can 
become  empty.  Thus,  at  this  point,  three  algorithmic  states  can  arise: 

1)  An  empty  clause  is  produced.  The  algorithm  ends  by  determining  that  the 
original  formula  is  unsatisfiable;  2)  No  positive  clauses  have  emerged.  The  al¬ 
gorithm  ends  by  determining  that  the  formula  is  satisfiable;  and  3)  A  positive 
clause  is  produced.  Then,  the  algorithm  applies  the  And-Elimination  rule  and 
adds  new  unit  clauses  to  the  formula.  Thus,  a  new  iteration  of  the  described 
operations  above  are  carried  out  with  these  new  unit  clauses. 

We  begin  the  description  of  the  algorithm  by  a  very  simple  recursive  version 
in  order  to  help  the  reader  to  understand  it:  each  inference  rule  is  implemented 
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by  one  procedure.  Thus,  procedures  VTA  and  AE  must  perform  the  following 
operations  according  to  the  definition  of  their  corresponding  inference  rules: 

(VTA  F  p):  It  applies  the  VTA  rule  returning  the  formula  F'  resulting  of 
removing  from  F  the  clause  (p)  and  the  the  conjunctions  (~>p  A  ~>pi  A  ...  A  ~<Pk), 
(-i p  A  Df  A  ...  A  D~n )  and  all  the  remaining  occurrences  of  ~>p. 

(AE  F  C+):  It  applies  the  And- Elimination  rule  returning  the  formula  F' 
resulting  of  removing  C+  from  F  and  adding  the  unit  clauses  (p)  for  each  con¬ 
junct  p  in  C+ . 

We  note  F+  the  set  of  positive  clauses  in  F  which  can  be  empty.  The  main 
procedure  is  configured  by  the  simple  next  code. 

VTA-AE-Propagation(F) 

If  F+  =  {}  then  return(sat) 

If  □  G  F  then  return  (unsat) 

If  ( p )  £  F  then  return(VTA-AE-Propagation  (VTA  F  p)) 

If  C+  €  F  then  return(VTA-AE-Propagation  (AE  F  C+)) 

Theorem  4.  This  algorithm  returns  sat  iff  the  input  formula  F  is  satisfiable. 

Theorem  5.  The  maximal  number  of  recursions  is  at  most  in  0(size(F)). 

The  previous  algorithm  is  correct  but  not  very  efficient.  Its  efficiency  is  similar 
to  that  of  the  methods  proposed  in  [11-13].  Although  the  number  of  recursions  is 
bounded  by  O(n),  the  complexity  of  each  line  is  clearly  not  constant  and  so,  the 
algorithm’s  complexity  measured  in  computer  instructions  number  is  not  linear. 
One  can  check  that  searching  for  the  clauses  including  some  occurrence  of  ->p 
without  a  suitable  data  structure  has  0(size(F))  computational  cost.  Hence,  the 
real  complexity  of  the  algorithm  in  number  of  computer  instructions  is  at  least 
in  0{n2). 

Optimal  algorithm  description.  Next,  we  proceed  introducing  suitable 
data  structures  to  bound  the  worst-case  complexity  and  we  shall  do  it  progres¬ 
sively  in  order  to  facilitate  the  understanding  of  the  whole  algorithm  since  its 
complete  definition  in  pseudo-code  contains  many  details.  Thus  first  we  shall 
discuss  the  VTA  operations  relative  to  the  CNF~  term,  second  to  the  DNF~ 
and  finally  to  both  together.  So,  the  structure  of  (VTA  F  p)  is  as  follows: 

(VTA  F  p) 

(VTA-CNF  p) 

(VTA-DNF  p) 

(VTA-DNF-CNF  p) 

The  function  in  the  last  line  potentially  returns  conjunctions  Cf  if  both  neg¬ 
ative  terms  of  their  respective  clauses  CNFf  and  DNFf  have  been  falsified. 
Now  let  us  focus  in  each  one  of  these  three  parts  of  VTA. 
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CNF~  processing.  A  CNF~  =  A  ...  A  D~  is  falsified  when  all  the 
propositions  in  the  negative  literals  of  at  least  one  disjunction  DJ  in  the  CNF~ 
are  derived.  According  to  the  VTA  inference  for  each  ( p )  £  F  the  negatives 
occurrences  ~<p  must  be  removed.  To  perform  this  step,  we  should  search  the 
- <p  occurrences  and  remove  them  from  the  CNF  term.  But  this  search  has  an 
O(n)  overhead.  To  render  this  cost  constant  we  use  a  counter  N  eg. Counter  (I)-' ) 
associated  to  each  disjunction  D~  £  CNF~.  Thus  the  physical  removal  is  sub¬ 
stituted  by  the  decrement  operation.  Each  decrement,  done  in  0(1)  time,  is 
equivalent  to  one  removal.  In  this  way,  the  counter  associated  with  a  disjunction 
indicates  the  number  of  literals  whose  proposition  has  not  been  deduced  yet. 
Whenever  a  counter  is  set  to  zero  a  flag  flag(CF  ,CNF)  is  turned  on  and  this 
will  be  used  in  the  conjunct  processing  of  the  CNF~  and  DNF~  terms. 

Therefore,  to  handle  the  CNF~  term  we  require  two  data  structures: 
Neg.D(p ):  Set  of  pointers  to  couples  (D~ ,CF)  such  that  -<p  £  D~  £  CF]  and 
N  eg. C ounter  (D~):  Counter  of  negated  propositions  in  the  disjunction  D~ 
such  that  p  has  not  been  deduced  yet. 

Notation.  Henceforth,  [X]  denotes  a  pointer  to  the  object  X.  For  example  [ CF ] 
is  a  pointer  to  the  clause  CF . 

The  algorithmic  application  of  VTA  over  the  CNF  term  is  as  follows: 

VTA-CNF  (p) 

Remove  (p)  from  F 

for  V([D-],  [CF])  £  Neg.D(p)  do: 

Decrement  N eg. C ounter (D~) 

if  N eg. C ounter (D~)  =  0  then  flag(CF ,CNF)  1 

DNF~  processing.  The  processing  of  the  DNF  by  the  VTA  when  a  clause 
( p )  is  deduced  is  based  also  in  the  adequate  handling  of  counters.  However, 
these  counters  operate  differently  that  in  the  CNF  case.  Indeed  for  each  DNF  a 
unique  counter  is  associated.  Thus  each  decrement  must  correspond  to  one  and 
only  one  conjunction  C~  £  DNF.  So,  the  counter  is  set  to  zero  when  at  least 
one  proposition  p  for  each  conjunction  is  deduced  and  each  of  these  propositions 
falsifies  each  conjunction  C~  £  DNF because  ->p  £  C~. 

But,  notice  that  further  deduced  propositions  whose  complements  belong 
to  the  same  negative  conjunction  must  not  provoke  decrements  of  the  counter. 
Indeed,  only  one  decrement  for  each  conjunction  can  be  enabled.  Otherwise,  the 
counter  could  be  set  to  0  without  having  falsified  the  whole  DNF~ .  Indeed  that 
could  happen  if  DNF~  =  Cf  V  C%  V  ...  V  CT  V  ...  V  CjT  and  the  propositions 
deduced  are  not  distributed  in  all  the  k  negative  conjunctions. 

To  ensure  that  the  counter  is  decremented  only  once  per  each  negative  con¬ 
junction,  we  require  a  flag  First(C V)  which  indicates  whether  any  proposition 
whose  negation  is  in  has  been  already  derived.  Thus,  the  meaning  of  the 
aforementioned  flag  is  First(C V)  =  0  (false)  if  no  proposition  in  (7t~  has  been 
deduced.  Once  the  first  proposition  is  deduced  First(C a)  is  set  to  1  (true). 

Similarly,  to  the  CNF  case,  if  the  DNF  part  of  a  clause  is  falsified  then  a  flag 
fla,g(CF ,  DNF)  is  turned  on.  With  the  two  new  data  structures  the  algorithm 
corresponding  to  the  execution  of  the  VTA  over  the  DNF  term  is  the  following 
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VTA-DNF  (p) 

for  V([C-],  [CF])  6  Neg.C(p )  do: 
if  First(C~)  —  0  then  do: 

Decrement  Neg.Counter{CF) 

First(C~)  <-  1; 

if  Neg.Counter(CF)  =  0  then  flag(CF,DNF)  t—  1 

CNF~  and  DNF~  conjunct  processing. 

Both  algorithmic  operations  above  have  been  done  independently,  now  we  re¬ 
quire  to  joint  the  effect  of  the  algorithms  by  considering  the  state  of  both  flags 
flag(CF,CNF)  and  flag{CF ,DNF).  Thus,  if  both  flags  are  turned  on  means 
that  the  disjunction  CNF~  V  DNF~  of  CF  has  been  falsified  and  then  the  con¬ 
junction  C+  of  CF  is  deduced  since  CF  =  CNF~  V  DNF~  V  C+ .  If  C+  is  the 
empty  conjunction,  the  initial  formula  is  unsatisfiable,  otherwise  the  procedure 
AE  will  be  launched.  Thus  the  algorithmic  steps  are: 

VTA-DNF-CNF  (F  p) 

for  V[Ci?J  e  Neg.C(p)  U  Neg.D(p )  do: 

if  flag(CF,CNF)  =  flag(CF,DNF)  =  1  then: 
if  3 C+  =  then  return  (unsat) 

else  C+  —  (pi  A  ...A  pn ) 

AE  algorithm.  We  will  now  describe  the  application  of  the  AE(C+)  pro¬ 
cedure,  i.e.  the  And-Elimination  inference  rule.  We  observe  first  that  a  same 
proposition  p  can  be  deduced  in  more  than  one  conjunction  C+,  and  then  the 
counters  could  be  decremented  more  than  once.  To  disallow  these  multiple  decre¬ 
ments,  we  use  a  boolean  variable  as  follows:  Val(p)  =  1  iff  p  has  already  been 
derived  from  F.  So,  the  truth  propagation  of  variable  p  is  allowed  only  when 
the  flag  Val(p )  is  set  to  0,  and  once  the  propagation  has  been  performed,  the 
flag  is  set  to  1  disallowing  further  propagations.  Also,  a  list  C+  of  non-negated 
propositions  in  CF  is  required.  Thus,  the  procedure  AE  becomes: 

AE  (C+) 
for  Vp  G  C+  do: 
if  Val(p)  =  0  then  do: 

Add  (p)  to  F 
Val(p)  A-  1 

Complete  algorithm.  Finally,  we  present  the  definitive  version  of  the  whole 
algorithm.  By  lack  of  space  we  omit  the  procedure  to  initialise  all  data  structures 
employed  by  the  algorithm.  To  improve  the  search  of  unitary  clauses  (p)  and 
conjunctions  ( C+ )  these  are  placed,  when  they  are  deduced,  into  two  respective 
stacks  stack. p  and  stack. C  with  the  purpose  of  avoiding  searching  them  inside 
the  formula.  Thus,  the  complete  algorithm  is  given  below. 
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while  stack. p  ^  {}  do: 

p  4-  pull(stack.p) 

;VTA-CNF  (p) 

for  \/[D~  ,C'F]  £  Neg.D(p )  do: 

Decrement  Neg.Counter(D~ ) 
if  Neg.Counter(D~)  =  0  then  flag(CF,CNF)  4-  1 
;VTA-DNF  (p) 

for  V[C~,CF }  £  Neg.C(jp)  do: 
if  First(C~)  =  0  then  do: 

Decrement  Neg.Counter(CF)\ 

First(C~)  4—  1; 

if  N eg. Counter(CF)  =  0  then  flag(CF,DNF)  4-  1 
;VTA-DNF-CNF  (p) 

for  \/(CF)  €  Neg.D(p )  U  Neg.C(p)  do: 

if  flag(CF,  CNF)  =  flag(CF,DNF)  =  1  then: 
if  3C+  =  □  £  C*  then  return  (unsat) 

Else  push(C+,  stack.C) 

;AE  (C+) 

while  stack.C  ^  {}  do: 

C+  <-  pull(stack.C) 
for  Vp  £  C+  do: 

if  Val(p)  =  0  then  do: 
push(p  ,  stack.p) 

Val(p)  4—  1 

return  (sat) 

Theorem  6.  Correctness.  The  algorithm  described  in  the  above  lines  returns 
“unsat”  iff  F  is  unsatisfiable. 

The  proof  follows  from  the  correctness  of  the  Logical  Calculi  given  that  the 
algorithm  is  an  implementation  step  by  step  of  the  inference  rules.  Each  one 
of  the  operations  performed  by  the  algorithm  has  its  counterpart  in  the  pure 
inference  process  as  it  has  been  described. 

Theorem  7.  The  complexity  of  this  algorithm  is  strictly  in  0(size(F)). 

The  proof  of  this  theorem  basically  resides  in  the  fact  that  each  proposition 
is  introduced  at  most  once  in  stack.p  and  each  conjunction  C+  at  most  once  in 
stack.C.  Hence,  the  total  cost  of  the  “for”  loops  is  bounded  by  the  number  of 
occurrences  of  literals  in  the  initial  formula. 

5  Conclusions 

We  have  defined  a  new  class  of  non-clausal  formulas  having  a  Horn-like  shape. 
This  class  includes  Horn  formulas  as  a  particular  case.  Secondly,  we  have  pre¬ 
sented  a  calculus  and  showed  its  soundness  and  completeness.  And,  finally,  we 
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have  designed  a  strictly  linear  algorithm  to  solve  the  SAT  problem  in  this  class. 
Our  formulas  are  of  relevant  interest  in  many  applications  as  for  instance  those 
based  on  Rule  Based  Systems  where  they  can  benefit  of  the  use  of  a  richer  lan¬ 
guage  than  Horn  classical  formalisms.  The  proposed  formulas  represent  logically 
equivalent  pure  Horn  problems  but  with  exponentially  less  symbols.  Hence,  as 
the  described  algorithm  runs  in  linear  time,  the  gain  of  time  can  be  of  an  expo¬ 
nential  order  with  respect  to  the  known  linear  algorithms  running  on  the  Horn 
formulas. 
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Abstract.  The  main  object  of  this  paper  is  to  propose  an  intelligent 
system  dealing  with  affirmative  or  negative  information.  We  do  not  refer 
to  a  logical  negation  but  to  a  linguistic  one.  Moreover,  not  only  atomic 
but  also  complex  nuances  can  be  denied.  Among  the  intended  meanings 
of  a  linguistic  negation,  the  choice  is  made  by  using  the  strength  of  the 
user  negation  and  a  preference  principle  which  takes  into  account  the 
answer  simplicity. 


1  Introduction 

In  this  paper,  we  present  a  general  model  dealing  with  nuanced  information  ex¬ 
pressed  in  affirmative  or  negative  forms  as  they  may  appear  in  knowledge  bases 
including,  rules  like  “If  the  patient  is  not  vaccinated,  the  inflammation  due  to 
the  test  is  not  moderate  or  high”  and  facts  like,  “the  inflammation  due  to  the 
test  is  not  small” .  In  our  work,  we  do  not  refer  to  classical  logic  negation  but  to 
a  new  kind  of  negation  called  linguistic  negation.  The  representation  of  affirma¬ 
tive  information  can  be  made  by  using  the  model  proposed  in  [1]  which  can  deal 
with  nuanced  information  within  a  fuzzy  context  [11],  and  the  representation  of 
negative  information  can  be  based  on  the  methodology  proposed  in  [6-8].  Pre¬ 
vious  models  [6],  [8]  have  been  improved  in  [7]  in  such  a  way  that  a  user  can 
now  deny  a  combination  of  nuanced  properties  called  here  a  complex  nuance. 
The  first  object  of  this  paper  has  been  to  improve  previous  negation  operators 
in  such  a  way  that  all  negation  forms  Ft  of  complex  nuances  U  come  from  the 
mechanism,  denoted  “All  ...  except  ,  which  defines  Neg((x,  U),  the  reference 
frame  of  the  linguistic  negation  of  “x  is  U”,  given  Ft.  The  strength  p  with  which 
the  user  denies  “x  is  U”  defines  Neg£  (x,  U)  a  set  containing  V  the  affirmative 
translations  of  the  denied  nuance  U.  In  many  cases,  the  user  wishes  only  one 
affirmative  translation,  so  a  choice  strategy  extracts  one  intended  meaning  be¬ 
longing  to  Negj  (x,  U).  In  [6, 7],  the  choice  is  made  among  the  solution  leading  to 
the  greatest  membership  degree  pv(x).  When  the  linguistic  negation  is  restricted 
to  one  nuanced  property,  statistical  data  have  been  exploited  to  make  this  un¬ 
certain  choice  [3] .  The  second  object  of  this  paper  has  been  to  enrich  this  choice 
strategy  in  order  to  deal  with  denied  complex  nuances.  This  is  made  by  taking 
into  account  statistical  data  on  the  discourse  universe,  such  as  the  frequency  of 
use  of  a  fuzzy  property  associated  with  a  concept,  and  of  a  nuance  applied  to  a 
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property.  We  also  suggest  taking  into  account  linguistic  statistics  about  the  use 
of  negation  in  natural  language.  In  order  to  illustrate  this  model,  we  consider 
an  example  of  reasoning  on  medical  facts  and  rules  in  a  medical  diagnosis  field. 
The  statistical  information  are  calculated  by  exploiting  index  cards  written  by 
a  doctor  after  his  consultations.  From  the  following  three  index  cards  (ICi)  and 
five  rules  (Rj),  we  wish  to  deduce  a  diagnosis. 

ICI:  The  temperature  is  not  very  low,  the  eardrum  color  is  really  very  red,  fat 
eating  is  low,  the  tension  is  perfectly  (i.e.  “exactly”)  normal 
IC2:The  temperature  is  not  very  high,  the  eardrum  color  is  normal,  fat  eating 
is  high,  and  the  tension  is  normal. 

IC  3:  The  temperature  is  rather  little  high,  the  eardrum  color  is  normal,  fat 
eating  is  very  low  or  really  moderate,  the  tension  is  normal,  and  the  inflamma¬ 
tion  due  to  the  monotest  is  not  small. 

Rl:  If  the  temperature  is  high  and  the  eardrum  color  is  very  red,  the  disease  is 
an  otitis. 

R2:  If  the  temperature  is  not  normal,  the  patient  is  ill. 

R3:  If  fat  eating  is  not  moderate,  the  cholesterol  risk  is  not  low. 

R4:If  the  cholesterol  risk  is  high,  a  diet  with  no  fat  is  recommended. 

R5:  If  the  patient  is  not  vaccinated,  the  inflammation  due  to  the  test  is  not 
moderate  or  high. 


2  Nuanced  Expression  Probability 

We  suppose  that  the  discourse  universe  is  characterized  by  a  finite  number  of 
concepts  C*.  A  set  of  basic  properties  is  associated  with  each  C*,  whose  de¬ 
scription  domain  is  denoted  as  Dt.  For  example,  the  concept  “temperature”  can 
be  characterised  by  the  basic  properties  “low” ,  “normal”  and  “high”  ;  the  con¬ 
cept  “eardrum  colour”  by  the  basic  properties  “normal”  and  “red”  ;  the  concept 
“cholesterol  risk”  by  the  properties  “not-existent” ,  ’’low”,  “moderate”,  “high” 
and  “tremendous”,  the  concept  “feat  eating”  by  the  properties  “zero”,  “low”, 
“moderate”  and  “high” ,  and  the  concept  “inflammation”  by  the  properties  “not- 
existent”,  “small”,  ’’moderate”  and  “high”.  Moreover,  linguistic  modifiers  allow 
us  to  express  nuanced  knowledge  like  “the  temperature  is  rather  normal” .  In  this 
paper,  we  use  the  methodology  proposed  in  [1]  to  cope  with  affirmative  informa¬ 
tion  like  “x  is  fam/3P;fc”  and  with  negative  information  like  “x  is  not  Ftm/jPjfe” 
(where  fQ  and  mg  are  linguistic  modifiers  ).  In  the  following,  fQm,g  is  called  the 
nuance  applied  to  P^.  The  modifiers  fa  and  mg  are  taken  from  two  ordered 
sets  of  .fuzzy  modifiers.  The  first  one  groups  translation  modifiers  which  operate 
both  a  translation  and  a  precision  variation  on  the  basic  property:  in  this  paper, 
we  use  M9={m/j}ige[i.,9]  =  {extremely  little,  very  very  little,  very  little,  rather 
little,  moderately,  rather,  very,  very  very,  extremely}  (Figure  1).  The  second  set 
contains  precision  modifiers  which  make  it  possible  to  modify  the  precision  of 
the  properties.  Here,  we  use  Fg=  {fa}  Qg[i..6]  —  {vaguely,  neighboring,  more  or 
less,  moderately,  really,  exactly}  (Figure  2).  These  two  sets  are  totally  ordered 


An  Intelligent  System  Dealing  with  Complex  Nuanced  Information  545 


by  the  relation:  ma  <m(g  (resp.  fa  <{(})  &  a  <  (3.  The  modifier  “ moderately ” 
is  equivalent  to  the  empty  word. 


ExtfW^y  litlte  VwyJlrt*  Rather  Vrywry  I 

k  Very  vary  into  J  little  j  ,y#ry  j  fixtrwiwty 

»r  i  /fr\:Kr" 


Fig.  1.  Translation  Modifiers 


Fig.  2.  Precision  Modifiers 


From  the  set  of  index  cards  owned  by  the  doctor,  we  can  extract  statistical 
knowledge  about  the  use  of  nuance  expressions  in  his  language.  Note  that  these 
statistics  are  computed  only  on  affirmative  expressions  contained  on  these  index 
cards.  After  index  cards  analysis,  the  percentages  of  use  of  every  basic  property , 
when  the  doctor  speaks  about  a  given  concept,  are  available. 

Example:  For  the  concept  of  temperature,  the  statistical  probability  that  he  uses 
the  term  “low”  is  calculated  by  counting  the  occurrences  of  the  term  “low”  (pos¬ 
sibly  with  a  nuance)  with  regard  to  the  total  number  of  expressions  concerning 
temperature.  For  instance:  5%  of  positive  expressions  concerning  “temperature” 
use  the  term  “low”,  80%  the  term  “normal”  and  15%  the  term  “high”. 

In  the  same  way,  we  can  obtain  statistics  about  the  use  of  every  possible 
nuance  for  every  basic  property.  In  practice,  we  only  need  statistics  about  the 
use  of  relevant  nuances:  a  combination  of  modifiers  fam/3  is  relevant  if  it  is  used 
in  English.  Out  of  54  possible  nuances,  finally  only  20  are  relevant.  The  statistics 
concerning  the  use  of  these  20  relevant  combinations  are  computed  in  the  same 
way  as  for  basic  properties. 

Example:  For  the  property  “low”  associated  with  the  concept  of  “temperature”, 
the  statistical  probability  that  the  doctor  uses  the  nuance  “really  very”  is  com¬ 
puted  by  counting  the  occurrences  of  “really  very  low”  with  regard  to  the  total 
number  of  expressions  containing  “low”  and  concerning  temperature.  For  in¬ 
stance:  5%  of  positive  expressions  containing  “low”  use  the  nuance  “really  very”, 
20%  the  empty  nuance,  1%  “very  little”,  8%  “very”. . . 
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Proposition  1.  For  any  expression  e  =  “x  is  fam p  Pik  ”,  the  statistical  probabil¬ 
ity  that  the  doctor  uses  this  expression  in  an  index  card  when  he  speaks  about  the 
concept  Ci  is:  if  Pr(bp(e)  =  Puf)  >  0  then  Pr(e)  =  Pr(bp(e )  —  P fi)  xPr(n(e)  = 
fam/3\bp{e)  =  Pik)  else  we  set  Pr(e)  =  0,  where  n(e)  and  bp(e)  are  respectively 
the  nuance  and  the  basic  property  of  e. 

Example:  The  statistical  probability  that  “very  low”  appears  in  affirmative 
expressions  concerning  a  patient  “temperature”  is  equal  to  0.05  x  0.08  =  0.004. 

Definition  1.  Given  G  =  {Ah}h=i,...P  a  subset  of  Ni  (the  set  of  all  nuanced 
properties  associated  with  the  concept  Ci)  any  finite  combination  of  nuanced 
properties  ofG  based  upon  the  operators  A  and  V,  denoted  asU(A\, Ah, Ap), 
is  said  to  be  a  complex  nuance  induced  from  G.  The  set  of  all  complex 
nuances  induced  from  G  is  denoted  G*. 

If  no  confusion  is  possible,  then  U  stands  for  U {A\ , . . . ,  Ah , . . . ,  Ap) . 

More  formally,  if  we  suppose  that  any  complex  nuance  is  defined  in  the  prefixed 
form,  then  G*  can  be  recursively  defined  as  follows: 

1  -  VAhe  G,  Ah  €  G*,  2 -We  G*,  W  e  G*,  -yUV  e  G*,  with  7  e  {A,  V},  3 
-  Any  element  of  G*  results  from  a  finite  number  of  uses  of  rules  1  and  2. 

Definition  2  (First  preference  principle) .  Let  U(A\,..,  Ap)  and  V  ( B\  ,..,Bq) 
be  to  complex  nuances  where  every  Ai  (resp.  Bffi  are  distinct:  U  is  preferred  to 
V  if  and  only  ifUj=i..PPr(Aj)  >  Tlj=i..q  Pr(Bj)- 

Example:  The  expression  “x  is  low”  is  preferred  to  “x  is  low  or  really  very 
low”  since  Pr(  “x  is  low”)  >Pr(  “x  is  low”)  x  Pr(  “x  is  really  very  low”),  0.1  > 
0.1  x  0.0025  =  0.00025. 

3  Linguistic  Negation  Translations 

Taking  into  account  several  linguistic  results  [5,4,2],  Pacholczyk  showed  in  [6] 
that  when  the  user  says  “x  is  not  U”:(l)  he  rejects  the  sentence  “ x  is  U” ,  and 
(2)  he  possibly  refers  either  to  another  object  y  (i.  e.,  “y  is  {/”)  or  to  another 
property  V  (i.  e.,  “x  is  V”)  distinct  from  U  but  which  concerns  the  same  concept 
C i-  The  linguistic  negation  concept  used  here  is  defined,  as  in  [9],  by  a  one-to- 
many  mapping  from  E  into  fp(C:)  ( E  parts),  called  here  a  multi-set  function. 
In  this  paper  we  propose  a  new  definition  of  the  concept  of  linguistic  negation 
which  can  be  viewed  as  a  generalization  of  the  one  proposed  in  [8] .  Let  M  be  the 
set  of  every  possible  modifier  combinations  (including  irrelevant  combinations) 
(M  =  {/a}ae[  1..6]  X  {”h3}/3e[i..9])-  Given  a  concept  Ci,  let  Bi  =  {Pik}kejN  be 
the  set  of  basic  properties  associated  with  Ci  having  as  definition  domain  Di , 
be  the  set  of  all  nuanced  properties  associated  with  property  Pik,  and  Ni  be 
the  set  of  all  nuanced  properties  associated  with  the  concept  Ci .  Given  G  C  Ni, 
G*  denotes  the  set  of  all  complex  nuances  induced  from  G.  Then,  for  any  concept 
Ci,  we  define  a  linguistic  negation  operator  as  a  parameterized  function  Negt: 
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Definition  3.  For  any  concept  Ci,  let  U  and  V  be  two  complex  nuances  induced 
from  N*,  such  that:  U  =  U(niPikl,  njPikj,  . .  ■ ,  npPikp)  and  V  =  V{miPi jlf 
•  •  • ,  rrij  Pdj ,  ■  ■  ■  -  nriqPii,, )  -  and  let  L  —  {k4  ,  -  •  *  ,  kj ,  . . . ,  kp,  l\,  . . . ,  lj ,  . . . ,  lq }  A 
linguistic  negation  operator  is  a  function  Negt  :  x  AT*  — >  x  fpfTt*) 

defined  as  follows,  knowing  that  n£  M,  7  £  {A,  V}  and  t  £  {0,  1,  2,  3,  4,  5}  : 

-  Neg0(x,u)  =  (0,  0);  ^31(0;,  w)=(A\{x},  {u});  Neg2{x,  u)=({x},N?\{u}) 

-  Neg3( x,  nPik)=({x},  Nfk\{nPik}) 

and  Negz{x,  7 UV)  =  ({tr},(U heLNih)*  \  {jUV}) 

-  Neg4(x,  nPlk)=({x},  N*  \  N*k);  Neg4(x,  yUV)  =  ({x},  N*  \  (U heLNih)*) 

-  Neg5(x,  nPik)  =  ({a;},  Nfm)  where  Pim  ±  Pik  is  a  precise  property  in  Bi 
and  Neg$(x,  7 UV)  =  ({a;},  (UmeG  Nim )*)  where  G  is  a  set  of  index  such  that 
LC\G  —  0  and  {Pim  G  Ni  \  m  £  G]  C  Bi. 

It  is  possible  to  associate  a  standard  form  F(  for  “x  is  not  U”  with  each 
scope  of  the  negation  operator.  When  a  speaker  says  ux  is  not  17”,  he  means 
that: 

-  Fo:  For  this  x,  “x  is  U  “  is  rejected  and  there  is  not  any  corresponding  af¬ 
firmative  expression.  Saying  “the  disease  is  not  an  otitis”  may  not  admit  any 
affirmative  translation,  the  only  thing  that  the  doctor  knows  about  the  disease 
is  that  it  is  not  an  otitis. 

-  Fi:  Another  object  of  the  same  domain  satisfies  the  same  nuanced  property  U. 
“x  is  not  U"  means  “not(x)  is  U” .  The  doctor  can  note  that  “Mary  is  not  ill” 
because  he  knows  that  “it  is  John  and  Jack  who  are  ill”. 

-  F2:  The  same  x  satisfies  another  complex  nuance  of  N*.  For  instance,  the 
doctor  can  say  that  “The  temperature  is  not  very  high”  because  he  hesitates 
between  “the  temperature  is  normal”  and  “the  temperature  is  little  high” . 

-  F3:  The  same  x  satisfies  another  complex  nuance  of  N*  induced  from  the  same 
basic  properties.  The  doctor  can  say  “the  temperature  is  not  low”  because  he 
thinks  that  “the  temperature  is  really  low” . 

-  F4:  The  same  x  satisfies  a  complex  nuance  of  N*  which  is  not  induced  from  the 
same  basic  properties.  The  doctor  can  say  “cholesterol  risk  is  not  low”  because 
he  thinks  that  “cholesterol  risk  is  at  least  medium” . 

-  F5:  The  same  x  satisfies  a  complex  nuance  of  N*  induced  from  other  precise 
basic  properties.  The  doctor  can  say  that  “the  patient  is  not  very  big”  since  he 
thinks  that  “he  is  rather  thin” . 

-  Fg:  The  same  x  satisfies  a  complex  nuance  induced  from  new  affirmative  ba¬ 
sic  properties.  In  this  case,  “x  is  not  U”  means  that  the  new  property  “not-U” 
is  associated  to  the  same  concept  as  U.  “The  patient  is  not  seriously  ill”  may 
introduce  a  new  basic  property  “not-seriously-ill” . 

Note  that,  in  this  paper,  we  are  not  concerned  with  the  forms  Fi  or  Fq  . 

Remark  1.  Let  t  €  {2,. .,5},  “x  is  not  U”  means  that  there  exists  a  complex 
nuance  V  such  that  we  have  “x  is  V” .  This  V  is  defined  as  follows:  V  G  F  such 
that  Negt(x,U)  =  ({z},  F)  with  F  G  In  other  words,  “x  is  V”  is  an 

affirmative  translation  of  “x  is  not  U”  in  form  Ft. 

If  no  confusion  is  possible,  in  the  following,  we  simply  write  VG  Negt(x,U). 
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Proposition  2.  Neg5(x,  u)  C  Neg4(x,  u)  C  Neg2(x,  u);  Negs(x,  u)  C  Neg2(x, 
u). 


Assumption  :  Statistics  on  the  most  used  negation  forms  are  available  for  any 
nuanced  expression.  These  statistics  come  from  a  linguistic  analysis  of  English 
negation. 

For  instance,  “Temperature  is  not  normal”  usually  corresponds  to  a  nuance 
of  “low”  or  “normal”  or  “high”  and  refers  to  the  form  F2  in  70%  of  cases.  On 
the  other  hand,  “Temperature  is  not  very  low”  is  related  in  60%  of  cases  to  a 
nuance  of  “high”  (F5).  “Fat  eating  is  not  moderate”  usually  means  that  it  is 
at  least  “high”  (F5).  “Cholesterol  risk  is  not  low”  means  usually  that  it  can 
be  “moderate”,  “high”  or  “tremendous”  (F4).  “Temperature  is  not  very  high” 
usually  corresponds  to  another  nuance  of  “high”  (F3). 

4  Linguistic  Negation  Strength 

Let  us  notice  that  the  subset  Negt(x ,  U)  systematically  excludes  U  but  it  must 
also  exclude  complex  nuances  which  are  close  to  U.  So,  a  complex  nuance  V 
may  be  chosen  as  a  negation  of  an  expression  u  =“x  is  {/”  if  V  is  a  complex 
nuance  associated  with  the  same  concept  as  U  and  if  for  a  given  x:  if  Puix) 
(resp.  pv(x))  is  close  to  1  then  pv(x)  (resp.  pu(x))  is  close  to  0. 

Definition  4.  Let  0.5<  p  <1.  Let  us  suppose  that  Q  is  a  concept,  Ft  a  standard 
form  ( t  E  [2. .5]),  Negt  a  linguistic  negation  operator  and  U(A\,  ,  Ah,  ,  Ap)  (or 
JJ)  is  a  complex  nuance  induced  from  TV;.  The  linguistic  negation  of  U 
applied  to  x,  given  p  and  the  standard  form  Ft,  denoted  as  Negf  (x,  U), 
is  a  set  of  complex  nuances  induced  from  Ni.  More  precisely,  V(B\,  ,  Bm,  ,  Bq) 
(or  V)  being  a  complex  nuance  induced  from  Ni,  then  VENe(fi.(x,  U)  iff: 

LO:  yBm,  3Ah  such  that:  LOO:  Bm  ENegt(x,  Ah),  L01  :  Vz,  pAh(z)  >  P  =k 
Pb,„  (z)  <  1  -  P,  LOZ  :  Mz,  pBm  (z)  >  p  =>  pAh  (z)  <  1  -  p. 

LI:  Vz,  {pv(z)>  p  =>  pv(z)<l-p},  and  L2:  Vz,  {pv(z)>  p  =>  pv(z)<l-p}. 
Any  V<LNecf(  (x,  U)  is  said  to  be  a  linguistic  negation  of  U  applied  to  x, 
given  p  and  the  standard  form  Ft. 

Definition  5.  Let  U  and  V  be  associated  with  the  same  concept,  Vox(U,  V)  = 
min{pu(x)  py(x),  pv(x)  — >£  pu(x)}  where  —>l  is  Lukasiewicz! s  impli¬ 
cation.  Let  the  neighborhood  between  U  and  V  be:  Vo(U,  V)  =  min^Vox  (U, 

V)- 

Proposition  3.  If  Ve  Neg^fx,  U)  then  Vo(V,  U)  <  2(1  —  p) 

Definition  6.  Knowing  that  0.5  <  p  <  1,  the  value  2p  —  1  defines  the  rejection 
strength  associated  with  a  property  VE  Negh(x ,  U). 


Proposition  4.  Neg\(x ,  U )  C  Neg®^(x,  U)  C  ...  C  Neg®'b(x,  U). 
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Example:  Using  proposition  3,  Vo(“normal”,  “rather  little  high”) =0.2,  then 
p  <  0.9,  i.e.,  “rather  little  high ”  belongs  at  most  to  the  subsets  of  Neg®'9(x, 
“normal”).  For  Q  =  “extremely  high”,  Vo ( “normal”,  Q)  =  0  means  that  p  <  1, 
i.e.  “extremely  high”  belongs  to  the  subsets  of  N eg} (x,  “normal”). 

Definition  7  (Second  Preference  Principle) .  Given  a  standard  form  Ft  and 
an  expression  “x  is  U”,  let  p(V)  be  the  greatest  p  such  that  VG  Negf(x,  U).  V 
is  preferred  to  W  for  the  negation  of  U  if  p(V)  >  p(W). 

Example:  For  the  negation  of  “normal”  in  form  F5,  “extremely  high”  is  pre¬ 
ferred  to  “rather  little  hiqh”.  Since  p(  “extremely  high ”)  =  1  meanwhile  p(  “rather 
little  high”)  =  0.9. 

The  following  properties  result  from  previous  definition  of  linguistic  negation. 

Proposition  5.  (1):  Given  Ug  N*  and  Vg  N*:  Vg Nejft  ( x,  U)  if  UGNeg) (x, 
V).  (2):  A  complex  nuance  VGNecft(x,  U)  if:  (2a):  U=AAB  with  AgN)  and 
BgN*,  V=PVQ  with  PGNegP(x,  A),  QGNe£(x,  B),(2b):  U=A\/B  with  AgN) 
and  BgNI,  V=PAQ  with  PGNec^(x,  A),  QeNegf)(x,  B). 

Proposition  6  (Contraposition  Laws).  Let  us  suppose  that  the  implication 
— >  and  its  associated  T-norm  T  satisfy  the  properties:  u—>v=l  iff  u<v  ;  T(u-=>v, 
v—>w)<u—>w  (weak  transitivity  law);  u-^v=->v-+  -<u  (contraposition  law).  Then, 
the  extended  linguistic  negation  possesses  the  following  properties: 

(i)  If  there  exists  PGNecfi  (x,  A)  and  QGNeg pt  (y,  B)  such  that  Qc  ->J3  and  ->AcP, 
then  the  rule  if  “x  is  A”  then  “y  is  B”  implies  the  rule  if  “y  is  not  B”  then  “x 
is  not  A 

(ii)  If  there  exists  Pg  Neff)  (x,  A)  and  Qg  Nerft  (y,B)  such  that  Pc  ~'A  and  ->Bc  Q, 
then  the  rule  if  “y  is  not  B”  then  “x  is  not  A”  implies  the  rule  if  “x  is  A”  then 
“y  is  B”. 

(Hi)  If  there  existsPGNeg £  (x,  A )  and  QGNecf((y,  B)  such  that  -i A=P  and  =B=Q, 
then  the  rules  if  “y  is  not  B”  then  “x  is  not  A”,  and  if  “x  is  A”  then  “y  is  B” 
are  equivalent. 

Example  1.  Let  us  suppose  that: 

P=(Vrather  low  V  very  low  extremely  low)eNeg2(x,  V(medium  high)), 

(V  rather  low  V  very  low  extremely  low)  c  ->(Vmedium  high), 

Q=-i vaccinated.  In  the  rule  R5  :  “If  the  patient  is  not  vaccinated,  the  inflam¬ 
mation  due  to  the  test  is  not  medium  or  not  high” ,  the  conclusion  is  translated 
into  “the  inflammation  is  rather  or  very  or  extremely  low”.  By  using  previous 
proposition,  this  rule  implies  the  rule  “If  the  inflammation  is  medium  or  high 
then  the  patient  is  vaccinated” . 

5  Choice  Strategy 

We  extend  the  strategy  proposed  in  [3]  to  the  negation  of  complex  nuances: 

1.  First,  select  the  negation  form  Ft  according  to  linguistic  statistics  concerning 
the  complex  nuance  U  (assumption  presented  in  3) 
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2.  Compute  the  set  Negt{x ,  U). 

3.  Select  in  Negt(x,  JJ),  the  complex  nuances  V  such  that  p(V)  is  maximum. 

4.  Among  these  complex  nuances,  use  the  first  preference  principle  in  order  to 
make  a  choice. 

5.  Discriminate  ex-aequo  complex  nuances  by  taking  into  account  their  com¬ 
plexity  degrees.  This  last  point  is  justified  by  Sperber  and  Wilson’s  simplicity 
principle  [10]. 

6  The  Model  in  Action 

First  of  all,  we  are  going  to  translate  all  negative  complex  nuances  appearing 
in  rules  and  facts.  They  concern  the  expressions:  “temperature  is  not  normal”, 
“fat  eating  is  not  moderate” ,  “risk  of  cholesterol  is  not  weak” ,  “temperature  is 
not  very  low” ,  “temperature  is  not  very  high” ,  ”  the  inflammation  is  not  small” , 
“the  patient  is  not  vaccineted”  and  ’’the  inflammation  due  to  mono-test  is  not 
moderate  or  high” .  Let  us  decompose  the  affirmative  translation  of  the  assertion 
“Temperature  is  not  normal”:  according  to  linguistic  statistics  on  the  different 
forms  of  linguistic  negation,  this  negation  has  the  standard  form  F2.  So,  let 
us  calculate  Negz(x,  “normal”).  This  set  contains  all  nuances  of  all  properties 
associated  with  the  concept  “temperature”  (except  the  precise  nuance  expres¬ 
sion  “0  normal”):  Neg^ix,  “normal”)  =  ^temperature  \  {“normal”}.  Then  we 
must  select  the  properties  having  a  maximal  rejection  strength.  The  following 
properties  have  a  strength  equal  to  1:  properties  containing  “high”  or  “low” 
without  nuance,  or  with  translation  nuances  greater  than  “rather” ,  or  with  pre¬ 
cision  nuances  greater  than  “more  or  less” ,  or  with  combinations  “really  very” , 
“really  very  very” ,  “really  extremely” ,  and  also  the  three  properties  “extremely 
little  normal” ,  “very  very  little  normal”  and  “very  little  normal” .  Computing  the 
statistical  probability  of  all  these  nuanced  properties,  it  appears  that  the  best 
expression  is  “high” .  “Temperature  is  not  very  low”  is  translated  into  “Tempera¬ 
ture  is  high” .  “Temperature  is  not  very  high”  by  the  form  F2  leads  to  all  nuances 
of  “high” ,  which  finally  gives  “Temperature  is  little  high” .  “Inflammation  is  not 
small”  is  translated  into  “Inflammation  is  medium  or  high”  “Fat  eating  is  not 
moderate”,  by  the  form  F3  ,  leads  to  the  nuances  of  “zero”  “low”  and  high”, 
and  finally  is  translated  into  “Fat  eating  is  high”.  “The  cholesterol  risk  is  not 
low”  is  translated  into  “The  cholesterol  risk  is  high” .  The  rule  R5  is  translated 
as  shown  in  example  1.  Finally,  we  obtain: 

IC1:  The  temperature  is  high,  the  eardrum  color  is  really  very  red,  fat  eating  is 
low,  the  tension  is  exactly  normal. 

IC2:The  temperature  is  little  high,  the  eardrum  color  is  normal,  fat  eating  is 
high,  and  the  tension  is  normal. 

IC3:  The  temperature  is  rather  little  high,  the  eardrum  color  is  normal,  fat  eat¬ 
ing  is  very  low  or  really  moderate,  the  tension  is  normal,  and  the  inflammation 
due  to  the  mono-test  is  medium  or  high. 

Rl:  If  the  temperature  is  high  and  the  eardrum  color  is  very  red,  the  disease  is 
an  otitis. 
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R2:  If  the  temperature  is  high ,  the  patient  is  ill. 

R3:  If  fat  eating  is  high,  the  cholesterol  risk  is  high. 

R4:  If  the  cholesterol  risk  is  high,  a  diet  with  no  fat  is  recommended. 

R5:  If  the  inflammation  due  to  the  monotest  is  medium  or  high  then  the  patient 
is  vaccinated. 

From  index  card  1,  rule  R1  can  be  fired,  since  “really  very  red”  implies  “very 
red” ,  so  we  can  infer  that  “The  disease  is  an  otitis” .  Rule  R2  leads  us  to  deduce 
that  “The  patient  is  ill” .  The  other  two  rules  can  not  be  fired.  From  index  card 
2,  rule  R3  gives  that  “the  cholesterol  risk  is  high”,  then  rule  R4  leads  us  to 
recommend  “a  diet  with  no  fat”.  By  using  rule  R5,  we  can  conclude  that  the 
third  patient  is  vaccinated. 

7  Conclusion 

We  have  defined  a  general  model  of  linguistic  negation  of  assertions  like  “x  is  not 
U”  in  the  fuzzy  context.  This  approach  to  negation,  in  accordance  with  linguis¬ 
tic  analysis  pursues  preceding  works.  The  strategy  of  choice  uses  the  rejection 
strength:  one  selects  the  complex  nuances  which  belong  to  the  strongest  nega¬ 
tions  of  the  initial  complex  nuance  then  statistical  information  on  the  language 
and  on  the  customs  of  the  speaker  are  used. 
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Abstract.  Dynamic  programming  has  been  studied  extensively,  e.g., 
in  computational  geometry  and  string  matching.  It  has  recently  found  a 
new  application  in  the  optimal  multisplitting  of  numerical  attribute  value 
domains.  We  reflect  the  results  obtained  earlier  to  this  problem  and  study 
whether  they  help  to  shed  a  new  light  on  the  inherent  complexity  of  this 
time-critical  subtask  of  machine  learning  and  data  mining  programs. 
The  concept  of  monotonicity  has  come  up  in  earlier  research.  It  helps 
to  explain  the  different  asymptotic  time  requirements  of  optimal  mul¬ 
tisplitting  with  respect  to  different  attribute  evaluation  functions.  As 
case  studies  we  examine  Training  Set  Error  and  Average  Class  Entropy 
functions.  The  former  has  a  linear-time  optimization  algorithm,  while 
the  latter — like  most  well-known  attribute  evaluation  functions — takes  a 
quadratic  time  to  optimize.  It  is  shown  that  neither  of  them  fulfills  the 
strict  monotonicity  condition,  but  computing  optimal  Training  Set  Error 
values  can  be  decomposed  into  monotone  subproblems. 


1  Introduction 

Consider  the  optimal  multisplitting  problem  faced  by  classifier  learning  algo¬ 
rithms  in  processing  numerical  attributes.  Given  a  sample  S  containing  b  indivi¬ 
sible  subsets  and  an  evaluation  function  F  to  rank  partition  candidates,  find  the 
F-optimal  partition  of  S  with  at  most  k  intervals.  We  denote  a  partition  with  k 
intervals  by  |+)*=1  ft*.  The  examples  are  instances  of  m  classes. 

Numerical  attribute  domain  partitioning  is  a  time-critical  subtask  in  machine 
learning  and  data  mining  algorithms.  Recently  there  have  been  many  attempts 
to  enhance  the  efficiency  of  this  task  [2,3,5,6,7,8,9,14].  Many  commonly-used 
functions  conform  to  cumulativity.  the  score  of  a  partition  is  a  weighted  sum 
of  its  interval  scores  [9,5].  This  property  lets  us  apply  dynamic  programming 
to  combine  the  solution  efficiently  from  optimal  partitions  of  subsequences.  The 
time  complexity  of  the  algorithm  is  only  quadratic  in  b. 

The  inherent  complexity  of  the  multisplitting  task  is  uncharted  territory. 
However,  mathematically  similar  problems  have  been  encountered  in  computa¬ 
tional  geometry  and  string  matching  [11].  This  paper  reflects  that  work  to  the 
multisplitting  framework.  It  turns  out  that  the  optimal  multisplitting  algorithms 
solve  as  a  subproblem  an  instance  of  the  column  minima  problem  [11],  for  which 
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lower  bound  results  are  already  known.  This  problem  takes  Q(b2)  time  if  the 
only  knowledge  that  we  have  about  the  function  is  that  it  is  cumulative.  If  the 
function  is  monotone,  it  can  be  optimized  in  l?(&log6)  time  and,  further,  a  so- 
called  totally  monotone  function  is  optimizable  in  linear  time.  Commonly-used 
attribute  evaluation  functions  do  not  fulfill  these  properties.  In  particular,  we 
show  that  two  functions,  Average  Class  Entropy  ( ACE)  [8,13]  and  Training  Set 
Error  ( TSE)  [2,3,9],  are  not  monotone. 

It  is  known  that  many  evaluation  functions — including  ACE  and  TSE — fulfill 
Jensen’s  inequality  [4].  Its  consequence  is  convexity  of  the  function  over  the  data 
set  from  which  it  follows  that  each  partitioning  of  the  sample  leads  to  a  better 
partition  score.  We  study  whether  Jensen’s  inequality  alleviates  the  inherent 
complexity  of  optimal  multipartitioning. 

TSE  is  the  only  commonly-used  evaluation  function  that  is  known  to  be  op¬ 
timizable  in  linear  time.  Even  though  the  function  itself  is  not  monotone,  its 
optimization  algorithms  can  combine  the  result  from  optimal  values  for  (totally) 
monotone  subproblems  in  a  single  scan  through  the  data  [2,3,9].  Similar  decom¬ 
position  of  more  complex  functions  does  not  seem  possible. 

Section  2  describes  the  optimal  multisplitting  problem.  Then  we  examine,  in 
Section  3,  the  monotonicity  formulations  that  have  emerged  in  string  matching 
and  computational  geometry  and  translate  Jensen’s  inequality  to  the  same  vo¬ 
cabulary.  It  corresponds  to  a  weak  form  of  monotonicity.  In  Sections  4  and  5 
we  study  the  functions  ACE  and  TSE,  which  have  different  known  optimization 
requirements.  Section  6  discusses  the  observations  presented  in  this  paper  and 
gives  the  concluding  remarks  of  this  study. 


2  The  Optimal  Multisplitting  Problem 

Assume  that  the  sample  S  =  ULi  with  b  indivisible  subsets  has  been  given. 
In  a  partition  [+)^=1  Rt  of  S  each  interval  is  composed  of  consecutive  sample 
subsets,  Ri  =  \y?=h  Sf_.  The  cumulative  attribute  evaluation  function  scoring 
partition  candidates  is  defined  as 


where  w(R)  =  \R\f(R)  is  the  score  given  to  interval  R  by  the  “impurity”  function 
/• 

The  score  of  an  interval  consisting  of  subsets  Sh,-..,Sj  is  denoted  by  w(h,  j ) 
=  w([J3f=h  Si).  Furthermore,  p{k,j )  denotes  the  minimum  value  of  p  on  k- 
partitions  of  S\ , . . . ,  Sj. 

The  dynamic  programming  algorithm  for  optimal  multisplitting  [5,9,14]  cal¬ 
culates  the  recurrence 


p{k,j) 


min k<h<.j{p(k  -  1,  h)  +  w(h  +  1,  j)}  if  k  <  j 
oo  otherwise 


(1) 
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and  outputs  p(k,b)  as  the  answer.  In  other  words,  the  best  partition  of  sub¬ 
sets  into  k  intervals  consists  of  some  non-empty  final  interval  Rk  = 

UU+i  St  together  with  the  optimal  ( k  —  l)-partition  of  the  sample  prefix  com¬ 
posed  of  subsets  Si, . . . ,  Sh,  where  h  >  k  —  1. 

In  recurrence  1  the  computation  of  best  ^-partitions  of  a  set  depends  on  the 
scores  of  the  best  ( k  —  l)-partitions  of  the  proper  prefixes  of  the  set.  Thus,  when 
calculating  values  on  row  k ,  the  algorithm  consults  the  values  on  row  k  —  1. 

The  time  complexity  of  the  dynamic  programming  computation  of  recur¬ 
rence  1  is  0(kb 2),  excluding  the  work  needed  to  compute  w(h,j)  for  each  1  < 
h<j<b.  There  are  Q{b2)  such  terms;  the  computation  of  each  value  requires 
scanning  the  class  frequency  distribution  in  <9(m)  time.  Thus,  the  total  comple¬ 
xity  for  obtaining  the  w(h,j)  values  is  Q(rnb2).  We  cannot  avoid  computing  all 
w(h,  j)  values  without  knowing  more  about  the  evaluation  function  than  just  its 
cumulativity. 

3  The  Column  Minima  Problem  and  Optimal 
Multiplitting 

The  calculation  of  one  row  of  the  matrix  p  is  an  instance  of  the  one-dimensional 
dynamic  programming  problem  introduced  in  string  matching  context  by  Galil 
and  Park  [10]. 

Definition  1.  Given  a  real-valued  function  w(h,j)  for  integers  0  <  h  <  j  <  b 
and  C[0],  the  one-dimensional  dynamic  programming  problem  is  to  compute 

C[j]  =  min  {D[h]  +  w(h,j)}  for  1  <j<b,  (2) 

0  <h<j 

where  D[h]  is  computed  from  C[h]  in  constant  time. 

From  the  above  formula  we  obtain  recurrence  1  by  replacing  C[j]  with  p(k,  j), 
D[h]  with  p(k  —  1,  h),  and  w(h,j)  with  w(h  +  1  ,j). 

The  one-dimensional  dynamic  programming  problem,  in  turn,  is  equivalent 
to  finding  the  column  minima  of  a  b  X  b  upper  triangular  matrix  C,  defined  by 

C[h,j]=p(k-l,h)  +  w(h  +  l,j).  (3) 

This  problem  is  well-studied  and  lower  bounds  for  its  time  complexity  are  known. 
Most  importantly,  it  has  been  shown  that  if  no  further  information  on  the  fun¬ 
ction  w  is  available,  solving  the  column  minima  problem  takes  time  fi(b2)  [11]. 
This  agrees  with  time  complexity  0(b2)  of  computing  one  row  of  the  optimal 
multisplitting  matrix  p  [5],  Hence,  any  algorithm  that  needs  the  scores  of  (k  —  1)- 
splits  when  computing  the  optimal  fc-split  of  a  sample  must  take  &(kb2)  time  if 
no  extra  information  about  the  function  being  optimized  is  available. 

Aggarwal  et  al.  [1]  studied  the  closely  related  problem  of  finding  the  row 
maxima  of  rectangular  matrices.  They  considered  two  kinds  of  monotonicity  in 
the  matrices.  We  translate  the  properties  to  the  column  minima  problem.  Finding 
the  column  minima  of  a  matrix  A  is  equivalent  to  finding  the  row  maxima  of  its 
negated  transpose,  —  AT . 
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Fig.  1.  Illustration  of  the  effect  of  monotonicity:  the  black  squares  denote  the  locations 
of  column  minima.  Only  the  gray  area  of  the  upper  triangular  matrix  needs  to  be 
examined  to  discover  the  column  minima  of  the  matrix. 


Definition  2.  Let  r(j)  be  the  smallest  row  index  for  the  minimum  value  in  the 
column  j  in  a  matrix.  The  matrix  is  monotone,  if  1  <  ji  <  j2  <  b  implies  that 
r(j i)  <  r(j2).  If  every  submatrix  of  A  is  monotone,  then  A  is  totally  monotone. 

An  equivalent  definition  of  total  monotonicity  was  given  by  Galil  and  Park 
[10].  Given  a  pair  of  row  indices,  hi  <  /i2,  and  a  pair  of  column  indices,  j\  <  j2, 
in  a  totally  monotone  matrix  C  the  following  holds. 

C[hi,  ji]  >  C[h2,  Ji]  =>  C[hi,  j2]  >  C\h2,j2].  (4) 

In  the  whole  matrix  both  monotonicity  and  total  monotonicity  imply  that  the 
minimum  row  indices  are  nondecreasing;  r(l)  <  r(2)  <  •  •  •  <  r(k). 

Aggarwal  et  al.  [1]  showed  that  the  time  complexity  of  the  column  minima 
problem  is  reduced  to  0(6  log  6)  when  the  matrix  is  monotone  and  introduced 
an  0(6)  algorithm  for  the  totally  monotone  case.  In  their  work  the  matrix  is  a 
square,  while  the  one-dimensional  dynamic  programming  matrix  is  upper  trian¬ 
gular.  This  does  not  affect  the  asymptotic  time  complexity,  because  a  6  x  6  upper 
triangular  matrix  contains  a  square  submatrix  of  size  [6/2]  x  [6/2].  It  is  easy 
to  construct  a  matrix  where  the  respective  column  minima  fit  inside  the  square 
submatrix.  Since  the  column  minima  problem  needs  to  be  solved  also  for  this 
submatrix,  the  same  0(6 log  6)  bound  applies  to  upper  triangular  matrices. 

In  searching  for  the  minimum  element  of  column  j2,  ji  <  j2,  in  a  monotone 
matrix,  we  never  have  to  look  rows  with  lesser  index  than  r(ji).  Thus,  the  area 
that  needs  to  be  covered  in  the  matrix  decreases  as  we  go  towards  higher  column 
indices  (see  Fig.  1).  The  matrix  manipulation  never  needs  to  back-up  to  the  rows 
with  a  smaller  index  than  the  least-recently  found  column  minimum  has. 

These  strong  monotonicity  conditions  require  the  optimized  function  to  be 
simple.  Unfortunately,  no  attribute  evaluation  function  is  known  to  satisfy  them. 
Hence,  they  do  not  seem  to  help  in  bringing  the  time  complexity  of  optimal 
multisplitting  down.  Subsequently  we  show  explicitly  that  two  commonly-used 
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attribute  evaluation  functions  are  not  monotone.  On  the  other  hand,  we  are  able 
to  explain  TSEs  linear-time  complexity  0{kmb)  with  the  help  of  monotonicity. 

The  most  useful  general  property  that  is  known  to  hold  for  many  evaluation 
functions  is  Jensen’s  inequality  [4]: 

w(h,i )  +w(i  +  1,  j)  <  w(h,j),  for  any  h  <i<  j.  (5) 

Intuitively,  this  inequality  states  that  we  obtain  a  better  score  for  a  partition  of 
the  sample  by  dividing  the  final  interval  into  two  rather  than  keeping  it  together 
as  one.  Taking  this  further,  any  splitting  of  the  data  will  make  the  partition 
score  better.1 

Inequality  5  is  equivalent  with  the  following  weak  monotonicity  condition: 

Vi  <j  :  C[h,i\  >  C[i,i ]  =>  C[h,j\  >  C[i,j\.  (6) 

Weak  monotonicity  has  the  following  interpretation.  If  a  (k  —  2)-split  of  a  prefix 
of  the  data  has  a  better  score  than  a  (k  —  l)-split  of  the  same  prefix,  the  optimal 
multisplit  cannot  contain  the  ( k  —  l)-split  as  its  part.  Elomaa  and  Rousu  [6]  use 
property  6  to  dynamically  prune  the  search  space.  In  conjunction  of  the  ACE 
function,  the  search  space  could  be  reduced  in  half,  which  indicates  that  the 
the  expected  case  behavior  of  the  search  algorithm  differs  considerably  from  the 
above  outlined  worst  case. 

This  property  is  not  as  helpful  as  the  stronger  monotonicity  conditions.  In 
the  worst  case,  the  items  C[i,i]  are  all  column  maxima  and  the  location  of  the 
column  minima  cannot  be  constrained  at  all.  Thus,  the  asymptotic  behavior  of 
the  multisplitting  algorithms  cannot  be  improved  below  J?(62)  with  the  help  of 
Jensen’s  inequality.  It  relates  partitions  of  different  arity,  while  the  monotonicity 
conditions  only  talk  about  partitions  with  the  same  arity.  In  the  following,  we 
show  by  two  counterexamples  that  Jensen’s  inequality  and  monotonicity  are  not 
equivalent. 

4  Average  Class  Entropy 

The  Average  Class  Entropy  of  a  partition  |+Ji  R-i  of  a  sample  S  is 

ace  (y  ^  =  |!f  £  \Ri\H(Ri )> 

where  H  is  the  entropy  function, 

m 

H{R)  =  -  £  P(Cj,  R)  log  P(Cj,R), 

3=1 

in  which  m  denotes  the  number  of  classes  and  P(C,  R)  stands  for  the  proportion 
of  the  examples  in  R  that  belong  to  the  class  C.  ACE  is  a  concave  function  by 
the  concavity  of  the  entropy  function  [4],  Thus,  it  fulfills  Jensen’s  inequality. 

1  This  is  the  reason  why  we  need  to  bound  the  partition  arities  in  practical  situations. 
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ACE  is  a  component  function  in  many  well-known  evaluation  functions  used, 
e.g.,  in  decision  tree  learning;  Information  Gain  and  Gain  Ratio  [13]  as  well 
as  Normalized  Distance  measure  [12]  are  examples  of  evaluation  functions  that 
build  upon  ACE. 

4.1  Average  Class  Entropy  Is  Not  Monotone 

We  show  by  a  simple  example  that  ACE  does  not  satisfy  monotonicity. 

Let  the  sample  S  =  (J.  ,  5,  consist  of  five  indivisible  subsets  with  the  follo¬ 
wing  class  distributions  for  the  two  classes: 

Si  =  (0, 3),  S2  =  (3, 0),S3  =  (17,5 ),S4  -  (4, 5),  55  =  (4,2). 

It  is  not  possible  to  split  the  first  indivisible  subset  into  two  partition  in¬ 
tervals.  Hence,  C[l,4]  =  C[l,5]  =  oo.  The  optimal  ACE  scores  of  the  other 
3-partitions  are: 

C[2,4]  =  ACE{SX  y52W  (S3  U  S4))  «  26.328 
C[3, 4]  =  ACE{S1  S±l  {S2  U  S3)  W  S4)  »  25.031 

C{ 2, 5]  =  ACE{SX  tti  S2  W  (S3  U  S4  U  SB))  «  31.847 
C[3, 5]  =  ACE(SX  W  (S2  U  S3)  W  (S4  U  S5))  *  31.963 
C[4, 5]  =  ACE(SX  W  (S2  U  S3  U  S4)  W  SB)  «  35.225 

The  values  C[3,4]  and  C[2,5]  are  the  minima  for  the  columns  4  and  5,  res¬ 
pectively.  Their  row  indices,  r(4)  =  3  and  r(5)  =  2,  violate  the  monotonicity 
condition,  which  requires  non-decreasing  row  indices  for  the  column  minima. 

5  Training  Set  Error 

The  majority  class  of  a  set  R,  denoted  by  majc(S),  is  its  most  frequently  oc¬ 
curring  class.  The  number  of  disagreeing  instances  is  given  by 

5{R)  =  |{r  G  R  |  vale (r )  +  majc(J?)}|. 

Training  Set  Error  is  the  number  of  training  instances  falsely  classified  in  the 
partition.  For  a  partition  (jj  R\  of  S  it  is  defined  as 

TSE  =Y/S(Ri). 

Also  TSE  is  concave  and,  thus,  fulfills  Jensen’s  inequality. 

The  linear-time  optimization  of  TSE  is  best  understood  from  the  relatively 
simple  algorithm  for  it.  Auer’s  [2]  Q(kmb)  time  algorithm  for  optimal  TSE 
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Table  1.  Auer’s  [2]  algorithm  for  optimal  TSE  partitions. 


Partition  optimalTSE(k,  S,  b,  m) 

/*  Partition  sample  S  into  at  most  k  intervals  so  that  the  training  set 
error  is  minimized.  There  sire  b  possible  cut  points  and  m  classes.  */ 

//  DATA  STRUCTURES:  P[h,j]  is  the  optimal  partition  of  the  processed 
//  data  into  h  intervals  with  the  last  one  labeled  by  j.  E[h,j]  is  the 
//  corresponding  number  of  disagreements.  E’  stores  intermediate  results. 
Partition  [k]  [m]  P;  int  [k]  [m]  E;  int  [m]  E’ ; 

for  (  h=  1  to  k  )  //  initialize 

for  (  j=  1  to  m  )  {P [h, j]  =  (Si,j);  E[h,j]=  0;} 
for  (  i=  1  to  b  )  //go  through  segments 

{  for  (  j=  1  to  m  )  {E’[j]=  E[l,j]  ;  E[l,j]+=  <5j(Si);} 
for  (  h=  2  to  k  ) 

{  j*=  arg  minjE’Cj];  E*=  Efj*]  ; 
for  (  j=  1  to  m  ) 

{  E’[j]=  E[h,  j]  ; 

if  (  E*  <  E[h, j]  )  {P[h, j]=  Pth-lJ’']  l±l  (Si(j);  E [h , j ]  =  E*  +  ^(Si);} 
else  {P [h,  j]  =  P[h,j]  ;  E[h,j]+=  5j(Si);}}}} 
j*=  arg  minjEtk,  j]  ; 
return  P[k,  j*]  ; 


multipartitioning  (Table  1)  assumes  that  the  sample  has  been  sorted  into  as¬ 
cending  order  by  the  value  of  the  numerical  attribute  under  consideration.  It 
maintains  for  each  class  j  the  optimal  partitions  of  the  prefixes  of  the  sorted 
data  into  1 , ,k  intervals  such  that  the  last  interval  is  labeled  by  j.  In  the 
algorithm  Sj(S)  denotes  the  number  of  disagreements  with  class  j  in  the  subset 
S;  5j(S)  =  |{s  G  S  I  valc(S)  Tall¬ 
in  processing  each  new  indivisible  example  segment  a  simple  update  of  parti¬ 
tions  suffices.  We  only  need  to  check  whether  the  new  segment  labeled  by  class  j 
obtains  less  total  disagreements  when  it  is  combined  together  with  the  previously 
known  best  partition  of  the  data  into  k  intervals  with  the  last  one  labeled  by 
j  or  together  with  the  previously  known  best  partition  of  the  data  into  k  —  1 
intervals  with  the  last  one  labeled  by  an  other  class  than  j. 

This  is  a  consequence  of  the  monotonicity  condition.  When  scanning  through 
the  data  the  disagreement  with  respect  to  each  class  grows  monotonically.  Auer’s 
algorithm  can  be  seen  to  compute  m  (totally)  monotone  optimization  problems 
in  parallel. 

5.1  Training  Set  Error  Is  Not  Monotone 

We  show  that  TSE  does  not  satisfy  monotonicity,  even  though  optimizing  it  can 
be  decomposed  into  m  monotone  subproblems. 
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Again,  let  the  sample  S  =  (J°=1  Si  consist  of  five  indivisible  subsets  with  the 
following  class  distributions: 

Si  =  (2, 1) ,  S2  =  (0, 1),  S3  =  (6, 2) ,  S4  =  (1, 3) ,  S5  =  (4, 2) . 

Like  above,  C[l,4]  =  C[l,5]  =  oo.  The  optimal  TSE  scores  of  the  other 
3-partitions  are: 

C[ 2, 4]  -  TSE(S i  a  S2  W  (S3  U  S4))  =  TSE({ 2, 1),  (0, 1),  <7, 5»  =  6 
C[ 3, 4]  =  TS£(Si  a  (S2  U  S3)  a  S4)  =  TS£«2, 1),  (6, 3),  (1, 3))  =  5 

C[2, 5]  =  T5£(5i  0  S2  a  (S3  U  Si  u  S5))  =  TSE((2, 1),  <0, 1),  <11, 7))  =  8 
C[3, 5]  =  TSEiSi  W  (S2  U  S3)  a  (S4  U  S5))  =  TSE{{ 2, 1),  (6, 3),  (5, 5))  -  9 
C[4, 5]  =  TSE(S !  W  (52  U  53  U  S4)  a  S5)  =  TS^((2, 1),  (7, 6),  (4, 2})  =  9 

The  row  indices,  r(4)  =  3  and  r(5)  =  2,  of  the  column  minima  are  increasing, 
contrary  to  the  monotonicity  condition. 


5.2  Remarks 

TSE  differs  from  other  functions  that  satisfy  Jensen  inequality  by  its  optimiza¬ 
tion  efficiency  that  is  below  the  inherent  complexity  of  dynamic  programming 
algorithms  based  on  recurrence  1.  Nevertheless,  TSE  itself  is  not  a  monotone  fun¬ 
ction  and  the  optimization  algorithm  may  have  to  back-up  to  recover  the  optimal 
partition.  This  can  also  be  seen  from  the  algorithm:  an  extra,  one-dimensional 
table  E’  is  needed  to  store  intermediate  results  in  the  dynamic  programming. 
The  number  of  back-up  points,  though,  is  constant;  there  is  only  a  single  location 
per  class  to  go  back  to. 

A  single  left-to-right  scan  over  the  ordered  data  suffices  to  reveal  the  optimal 
TSE  value,  because  for  a  fixed  class  and  arity  finding  the  optimal  TSE  value 
is  a  totally  monotone  problem.  Thus,  the  lower  bound  for  the  restricted  case  is 
f2(b).  Combining  this  for  all  k  arities  and  m  classes  gives  total  time  requirement 
of  fl(kmb),  which  is  the  asymptotic  time  requirement  of  Auer’s  algorithm. 

6  Discussion 

The  results  in  this  article  suggest  that  it  may  be  hard  to  improve  the  asymptotic 
time  complexity  of  the  optimal  multisplitting  algorithm.  If  the  algorithm  relies 
on  the  best  ( k  —  l)-partitions  to  be  computed  for  all  prefixes  when  computing 
the  fc-partitions,  17(62)  bound  seems  inevitable  for  each  row,  which  implies  total 
complexity  bound  Q(kb2)  for  the  whole  problem.  Hence,  if  more  efficient  algo¬ 
rithms  are  desired,  the  computation  should  be  arranged  in  some  other  way  than 
row  by  row  or  the  rows  should  be  made  sparse  by  utilizing  special  properties  of 
the  evaluation  function. 
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The  strongest  general  property  that  is  known  to  hold  for  many  evalua¬ 
tion  functions  is  Jensen’s  inequality,  which  guarantees  that  a  partition  cannot 
have  a  worse  score  than  its  subpartitions.  However,  this  alone  does  not  reduce 
the  asymptotic  time  complexity.  Stronger  properties  than  Jensen’s  inequality — 
monotonicity  or  total  monotonicity — lead  to  better  time  complexities.  Howe¬ 
ver,  as  shown  by  the  counterexamples,  neither  of  the  properties  hold  for  the 
commonly-used  ACE  and  TSE  functions. 

The  concept  of  monotone  function  is  potentially  relevant  for  the  attribute 
evaluation  functions  used  in  machine  learning  algorithms.  Designing  monotone 
or  even  totally  monotone  functions  would  allow  efficient  optimization  of  these 
functions.  Recent  advances  in  solving  the  optimal  multisplitting  problem  seem 
to  bring  the  solution  closer  to  the  problem’s  inherent  complexity  bound. 

Acknowledgements.  We  thank  Esko  Ukkonen  for  bringing  into  our  attention 
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Abstract  Itemset  share,  the  fraction  of  some  numerical  total  contributed  by 
items  when  they  occur  in  itemsets,  has  been  proposed  as  a  measure  of  the  im¬ 
portance  of  itemsets  in  association  rule  mining.  The  IAB  and  CAC  algorithms 
[4],  are  able  to  find  share  frequent  itemsets  that  have  infrequent  subsets.  These 
algorithms  perform  well  but  do  not  always  find  all  possible  share  frequent  item- 
sets.  In  this  paper,  we  describe  the  incorporation  of  a  threshold  factor  into  these 
algorithms.  The  threshold  factor  can  be  used  to  increase  the  number  of  frequent 
itemsets  found  at  a  cost  of  an  increase  in  the  number  of  infrequent  itemsets  ex¬ 
amined.  The  modified  algorithms  are  tested  on  a  large  commercial  database. 
Their  behavior  is  examined  using  principles  of  classifier  evaluation  from  ma¬ 
chine  learning. 


1  Introduction 

A  data  mining  problem  receiving  considerable  attention  is  the  discovery  of  association 
rules  from  market  basket  data.  The  problem  was  first  introduced  in  the  context  of  bar 
code  data  analysis  [1],  The  goal  of  bar  code  data  analysis  is  to  identify  buying  pat¬ 
terns  by  examining  itemsets,  groups  of  items  purchased  together  in  transactions.  From 
any  itemset,  an  association  rule  can  be  derived  which,  given  the  purchase  of  a  subset 
of  the  items  in  an  itemset,  predicts  the  probability  of  the  purchase  of  the  remaining 
items.  The  problem  of  discovering  association  rules  from  transaction  data  can  be 
decomposed  into  two  subtasks  [1]:  (1)  find  all  itemsets  meeting  a  minimum  frequency 
requirement,  and  (2)  generate  association  rules  from  the  frequent  itemsets.  The  sec¬ 
ond  step  is  relatively  easy  compared  to  the  first  [14].  The  focus  of  this  paper  is  the 
first  task,  the  extraction  of  frequent  itemsets  from  transaction  data.  While  the  problem 
and  our  methods  are  general,  we  present  the  problem  in  terms  of  the  retail  sales  do¬ 
main,  because  it  provides  easily  accessible  intuitions  for  explaining  problems,  con¬ 
cepts  and  solutions. 

In  general,  examination  of  all  possible  combinations  of  products  and  services  of¬ 
fered  by  a  retail  organization  is  impractical,  so  methods  are  needed  to  focus  effort  on 
itemsets  considered  important  to  an  organization.  Itemset  share,  the  fraction  of  some 
numerical  value,  such  as  total  quantity  of  items  sold  or  total  profit  contributed  by 
items  when  they  occur  in  an  itemset  has  been  proposed  as  a  measure  of  itemset  im¬ 
portance  [5],  Unlike  support  [1],  itemset  share  can  be  applied  to  non-binary  numerical 
data  that  are  typically  associated  with  items  in  a  transaction,  allowing  for  a  more  in 
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sightful  analysis  of  the  impact  of  itemsets  in  terms  of  stock,  cost  or  profit.  In  practice, 
itemset  ranking  by  support  and  share  can  be  significantly  different  [5]. 

The  support  measure  is  downward  closed  since  all  subsets  of  a  frequent  itemset,  are 
themselves  frequent  [3].  This  property  has  permitted  the  development  of  efficient 
algorithms  that  traverse  only  a  portion  of  the  itemset  lattice,  yet  find  all  possible  fre¬ 
quent  itemsets,  e.g.  [3,  9,  14].  However,  since  share  can  work  with  non-binary  nu¬ 
merical  values,  the  share  of  an  itemset  can  be  higher  than  the  share  of  its  subsets. 
Thus,  if  the  frequency  requirement  is  based  on  the  total  share  of  the  itemset,  frequent 
itemsets  might  contain  infrequent  subsets.  Algorithms  that  do  not  rely  on  the  property 
of  downward  closure  have  been  proposed  to  extract  this  class  of  frequent  itemset  from 
transaction  data  [4].  The  CAC  and  IAB  algorithms  perform  well  when  applied  to  a 
large  commercial  database,  finding  most  of  the  frequent  itemsets  while  counting  few 
infrequent  itemsets.  In  this  paper,  we  introduce  parametric  versions  of  these  algo¬ 
rithms.  A  parameter  called  the  threshold  factor  is  used  to  increase  (decrease)  the  abil¬ 
ity  of  the  algorithms  to  find  frequent  itemsets  at  the  cost  of  increasing  (decreasing)  the 
number  of  infrequent  itemsets  examined.  Algorithm  behavior  is  evaluated  using 
methods  developed  for  classification  system  evaluation  in  machine  learning. 

2  The  Share  Measure  and  Share  Frequent  Itemsets 

We  summarize  itemset  methodology  formally  as  follows  [2].  Let  I  =  {/,,  12, 1J  be 
a  set  of  literals,  called  items.  Let  D  =  [Tt,  Tv  ...,  TJ  be  a  set  of  n  transactions,  where 
for  each  transaction  T  e  D,  T  c  /.  A  set  of  items  X  c  /  is  called  an  itemset.  A  trans¬ 
action  T  contains  an  itemset  X  if  X  c  T.  Each  itemset  X  is  associated  with  a  set  of 
transactions  Tx  =  [T  e  D\Tz>  X},  the  set  of  transactions  containing  itemset  X. 

A  measure  attribute  (MA)  is  a  numerical  attribute  associated  with  each  item  in 
each  transaction,  such  as  quantity  sold  r4].  The  transaction  measure  value  of  item  1 
in  transaction  T.  tmv(I  ,T ),  is  the  value  of  a  measure  attribute  associated  with  I  in  T . 
The  global  measure  value  of  item  Ip,  MV(Ip),  is  the  sum  of  the  transaction  measure 
values  of  Ip  in  every  transaction  in  which  I  appears,  given  as 

MV(Ip)=  1  tmv(I p,T  ) .  (1) 

t^tIp 

The  total  measure  value  ( MV)  is  the  sum  of  the  global  measure  values  for  all  items  in  I 
in  every  transaction  in  D ,  given  as 

MV=2MV(Ip).  (2) 

p= i 

We  use  x,.  to  denote  the  t  item  of  an  itemset  X.  The  item  local  measure  value  of  item 
x.  in  itemset  X,  lmv(x{,X),  is  the  sum  of  the  transaction  measure  values  of  the  item  x,  in 
all  transactions  containing  X,  given  by 
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The  itemset  local  measure  value  of  itemset  X,  lmv(X),  is  the  sum  of  the  local  measure 
values  of  each  of  the  k  items  in  X  in  all  transactions  containing  X,  given  by 

lmv{ X )  =  X  lmv{xt ,  X ) .  (4) 

i 

The  item  share  of  an  item  x  in  itemset  X,  SH(xnX),  is  the  ratio  of  the  local  measure 
value  of  x.  in  X  to  the  total  measure  value,  as  given  by 

SH  (x, ,  X )  =  lmv(  jc,.  ,  X )  /MV  .  (5) 

The  itemset  share  of  itemset  X,  SH(X),  is  the  ratio  of  the  local  measure  value  of  X  to 
the  total  measure  value,  as  calculated  by 

SH(X)=lmv(X)/MV .  (6) 

We  illustrate  share  using  the  sample  transaction  database  in  Table  1.  The  TID  col¬ 
umn  gives  the  transaction  identifier  values.  Beneath  each  item  name  are  values  indi¬ 
cating  quantity  of  item  sold,  the  measure  attribute.  The  table  values  are  transaction 
measure  values.  For  example,  fmv(D,T  1)  is  14.  Global  measure  values  for  the  items 
are  indicated  in  the  last  row.  The  total  measure  value  MV  is  100.  Table  2  provides 
local  measure  values,  item  shares  and  itemset  shares  for  the  itemset  ACD  and  its  sub¬ 
sets.  For  itemset  AD,  Zmv(AD)  =  lmv(  A,  AD)  +  /mv(D,AD)  =  6+25  =  31  and  SH  (AD) 
=  Imv(AD)/ MV  =  31/100  =  0.31.  Support  (sup)  is  shown  for  comparison. 

Table  1:  Sample  Transaction  Database 


TED 

Item  A 

Item  B 

Item  C 

ItemD 

T1 

1 

0 

1 

14 

T2 

0 

0 

6 

0 

T3 

1 

0 

2 

4 

T4 

0 

0 

4 

0 

T5 

0 

0 

3 

1 

T6 

0 

0 

1 

13 

T7 

0 

0 

8 

0 

T8 

4 

0 

7 

T9 

0 

1 

1 

10 

T10 

0 

0 

18 

MV 

MV(I.) 

6 

1 

26 

67 

100 

Table  2:  Sample  Database  Summary 


Itemset  X 

Item  A 

Item  C 

Item  D 

X 

lmv 

SH 

lmv 

SH 

lmv 

SH 

lmv 

SH 

A 

6 

0.06 

6 

- 

- 

- 

- 

C 

0.8 

26 

0.26 

- 

- 

26 

0.26 

- 

- 

D 

0.7 

67 

0.67 

- 

- 

- 

- 

67 

AC 

0.2 

5 

0.05 

3 

0.03 

- 

AD 

0.3 

31 

0.31 

6 

- 

- 

25 

■IW 

CD 

0.5 

50 

0.50 

- 

- 

8 

mam 

ACD 

0.2 

23 

0.23 

0.02 

3 

0.03 

18 

■SSI 
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To  find  frequent  itemsets  with  infrequent  subsets,  we  employ  the  following  defini¬ 
tion  of  share  frequency.  An  itemset  X  is  share  frequent,  or  simply  frequent,  if  SH(X) 

>  minshare,  a  user  defined  minimum  share  value.  This  definition  of  frequency  is  not 
downward  closed.  A  property  P  is  downward  closed  with  respect  to  the  lattice  of  all 
itemsets  if,  for  each  itemset  with  the  property  P,  all  of  its  subsets  also  have  the  prop¬ 
erty  P  [12].  However,  the  share  of  an  itemset  may  increase  or  decrease  as  the  itemset 
is  extended  by  the  adding  an  item.  Adding  an  item  x  to  a  /:-itemset  X  to  create  a  new 
(k+l)-itemset  Y,  adds  a  restriction  to  the  measure  values  of  the  items  in  X.  The  meas¬ 
ure  values  associated  with  the  items  in  X  contribute  to  the  local  measure  value  of  Y, 
only  when  they  occur  with  the  new  item  x.  Their  contribution  towards  the  local 
measure  value  of  Y  must  be  less  than  or  equal  to  their  contribution  to  the  local  measure 
value  of  X.  However,  the  local  measure  value  of  x  is  added  to  the  local  measure  value 
of  Y,  which  may  be  less  than,  equal  to,  or  greater  than  the  local  measure  value  of  X.  If 
share  frequency  is  measured  against  the  share  of  the  itemset,  an  itemset  with  share 
above  minshare,  may  have  component  itemsets  with  share  below  minshare. 

3  Description  of  Algorithms 

The  Combine  All  Counted  (CAC)  and  Item  Add-Back  (IAB)  algorithms  extract  share- 
frequent  itemsets  from  transaction  data,  including  those  with  infrequent  subsets  [4]. 
Parametric  versions  of  the  algorithms,  which  we  refer  to  as  parametric  CAC  (PCAC) 
and  parametric  IAB  (PIAB),  are  introduced  here.  The  major  modification  is  the  addi¬ 
tion  of  a  threshold  factor. 

The  first  pass  through  the  data  collects  information  about  all  1 -itemsets  in  the  data. 
Summary  information  is  compiled,  including  MV  and  TCr  the  total  number  of  trans¬ 
actions.  Ct  is  the  set  of  candidate  itemsets  for  the  k'h  pass.  C2  is  generated  using  in¬ 
formation  about  the  1 -itemsets  and  information  about  the  candidate  2-itemsets  is  col¬ 
lected  in  pass  2.  The  process  of  building  Ck  using  itemsets  in  CM  continues  until  no 
candidate  itemsets  are  added  to  Ck.  After  the  k'h  pass,  the  local  measure  value  and 
transaction  count  is  available  for  each  counted  k-itemset. 

In  the  IAB  algorithm,  each  item  with  a  non-zero  transaction  count  is  given  the 
chance  to  contribute  in  the  generation  of  candidate  itemsets  for  every  pass.  In  the  k"' 
pass,  candidate  itemsets  are  generated  by  adding  to  each  itemset  in  Ck  i,  any  item 
found  in  the  first  pass  that  is  not  contained  in  the  itemset.  In  the  absence  of  pruning, 
this  would  produce  an  exhaustive  algorithm.  To  prevent  this,  three  types  of  pruning 
are  used.  First,  zero  pruning  removes  any  itemset  Xf  e  Ct  l  for  which  TCXi  =  0.  Sec¬ 
ond,  share  infrequency  pruning  removes  any  itemset  X,  e  Ckl  whose  actual  share 
SH(X.)  <  minshare.  Third,  predictive  pruning  uses  a  heuristic  to  calculate  the  pre¬ 
dicted  share  of  a  potential  candidate  itemset  Xpc,  PSH{Xpc),  and  prunes  any  Xp<  where 
PSH{Xpc )  <  minshare.  PSH(Xpc)  is  based  on  the  actual  share  of  components  of  Xpc  [4]. 

We  modify  predictive  pruning  to  create  the  PIAB  algorithm.  The  intuition  is  that 
itemsets  having  nearly  enough  share  should  not  be  pruned  since  their  supersets  may 
have  enough.  We  define  the  threshold  factor  ( TF)  as  a  parameter,  ranging  from  0.0  to 
1.0,  that  is  applied  to  the  share  threshold  prior  to  comparison  of  a  predicted  share 
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value.  Potential  candidate  itemsets  are  added  to  Ck  only  if  PSH(Xpc )  >  TF*minshare. 
For  TF  -  1.0,  the  parametric  versions  of  the  algorithms  behave  identically  to  the  non- 
parametric  versions.  The  threshold  factor  is  similar  to  the  relaxation  factor,  a  pa¬ 
rameter  that  has  been  applied  to  the  support  measure  as  a  means  of  adjusting  algorithm 
accuracy  [10]. 

The  processes  of  candidate  itemset  generation  and  itemset  pruning  are  encapsulated 
in  a  procedure  GenerateCandidateltemsets.  The  generation  of  the  next  potential 
candidate  itemset  X^,  is  represented  by  an  iterator  procedure  GenerateNextltemset. 
The  first  call  to  the  procedure  returns  the  first  generated  itemset,  repeated  calls  cycle 
through  all  possible  generated  itemsets  and  when  no  more  itemsets  can  be  generated, 
the  procedure  returns  false.  The  value  of  PSH(X ^  is  returned  by  the  function  Get- 
PredictedShare.  For  PIAB,  the  procedure  GenerateCandidateltemsets  is  written 
as: 

GenerateCandidateltemsets  (Ck  I) 

1  foreach  X,  e  C*-i 

2  if  TCxi  =  0  or  SH(X,)  <  minshare  then 

3  remove  X,  from  Ck. i 

4  while  X ^  :=  GenerateNextItemset()  do 

5  PSHiXp  :=  0,  SubsetCount  :=  0 

6  foreach  x  e  Xpe 

7  foreach  x  e  Xpc  where  i  *  j 

8  add  Xj  to  X, 

9  if  X  £  Ck  l  then 

10  continue 

1 1  PSH(Xpc)  :=  PSH(Xpc)  +  GetPredictedShare(x,X)) 

12  SubsetCount  :=  SubsetCount  +1 

13  if  PSHiX^I SubsetCount  >  TF*minshare  then 

14  Add  Xpc  to  Ck 

In  the  CAC  algorithm,  each  counted  itemset  is  given  a  chance  to  contribute  to  the 
generation  of  a  larger  frequent  itemset.  Itemsets  are  generated  by  combining  itemsets 
in  Ck  t  which  differ  only  in  their  last  item.  Again,  in  the  absence  of  pruning,  the  algo¬ 
rithm  is  exhaustive.  Zero  pruning  and  predictive  pruning  are  done  as  with  IAB.  In¬ 
frequency  pruning  is  done,  but  after  the  generation  of  candidate  itemsets.  In  addition, 
subset  pruning  prunes  any  generated  itemset  with  a  k-\  subset  not  found  in  Ct  y  To 
create  the  PCAC  algorithm,  the  threshold  factor  is  incorporated.  For  PCAC,  the  pro¬ 
cedure  GenerateCandidateltemsets  is  written: 

GenerateCandidateltemsets  (Ck  I) 

1  foreach  X,  G  Ck.\ 

2  if  TCXi  =  0  then 

3  remove  X,  from  Ck.\ 

4  while  Xp<  :=  GenerateNextItemset()  do 

5  PSH(Xpc )  :=  0,  SubsetCount 0 
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6  foreach  x  £  X 

7  foreach  x]  e  Xpc  where  i  *  j 

8  add  Xj  to  Xs 

9  if  X  £  Ck  ]  then 

10  break; 

1 1  PSH(Xpc)  :=  PSHiXJ  +  GetPredictedShare(x„X ) 

12  SubsetCount  :=  SubsetCount  +1; 

13  if  SubsetCount  =  k  and  PSH(X^I  SubsetCount  >  TF*minshare  then 

14  Add  Xpc  to  Ck 


We  illustrate  the  effect  of  the  threshold  factor  by  comparing  the  behavior  of  the 
CAC  and  PCAC  algorithms.  Figure  1  gives  the  itemset  lattice  for  the  sample  data  set 
in  Table  1.  Each  node  in  the  lattice  is  labeled  with  the  itemset  name.  Below  the  item- 
set  name  are  the  total  measure  value  of  the  itemset  in  all  transactions  and  the  number 
of  transactions  in  which  the  itemset  appears,  separated  by  a  forward  slash.  The  total 
measure  value  MV  is  equal  to  100  and,  assuming  minshare  is  0.20,  any  item  with  a 
measure  value  greater  than  or  equal  to  20  is  share  frequent.  Frequent  itemsets  are 
shaded  in  Figure  1.  A  threshold  factor  of  0.60  is  assumed. 


Figure  1 :  Itemset  Lattice 

CAC:  The  first  pass  identifies  1-itemsets  C  and  D  as  frequent  itemsets.  All 
counted  1-itemsets  are  used  to  generate  the  candidate  itemsets  for  pass  2.  However, 
before  any  2-itemset  is  added  to  C2,  we  prune  based  on  predicted  share  value  [4], 
Consider  itemset  AC.  SH{ A)  =  0.06,  TCA  =  3  and  577(C)  =  0.26,  TCc  =  8.  Since  TCA 
<  TCc,  PSH{ AC)  =  577(A)  +  SH(C)*(TCa/TCt )  =  0.06+0.26*(3/10)=0.14  which  is  less 
than  minshare ,  so  AC  is  pruned.  Only  AD  and  CD  meet  the  minimum  share  require¬ 
ment  so  they  are  the  only  2-itemsets  counted  in  pass  2.  The  algorithm  terminates  after 
pass  2  because  AD  and  CD  cannot  be  used  to  generate  a  3-itemset  with  all  subsets 


568  B.  Barber  and  H.J.  Hamilton 


existing  in  CM.  The  frequent  3-itemset  ACD  is  missed  because  the  subset  AC  was  not 
counted. 

PCAC:  PCAC  performs  in  the  same  way  as  CAC  until  predictive  pruning,  where 
PSH( AC)  =  0.14  is  now  compared  to  TF*minshare  =  0.12.  Since  the  less  stringent 
requirement  is  met,  AC  is  also  counted  in  pass  2.  After  the  second  pass  we  find  that 
AC  is  infrequent,  but  before  discarding  it,  we  use  it  to  generate  candidate  itemsets.  It 
is  combined  with  itemset  AD  to  generate  ACD.  Since  we  counted  AC,  AD  and  CD  in 
pass  2,  ACD  is  not  subset  pruned.  PSH( ACD)  =  0.26  which  is  greater  than 
TF*minshare,  so  ACD  is  counted  in  the  third  pass.  The  algorithm  terminates  after  the 
third  pass  since  no  4-itemsets  can  be  created  from  a  single  3-itemset.  We  counted  one 
additional  itemset  and  it  was  frequent.  By  setting  a  less  stringent  criteria  for  deciding 
which  itemsets  to  count,  PCAC  may  find  frequent  itemsets  that  CAC  missed.  How¬ 
ever,  the  increased  effectiveness  comes  at  the  cost  of  counting  more  infrequent  item- 
sets. 

4  Evaluation  Criteria 

An  algorithm  for  extracting  frequent  itemsets  from  a  data  set  can  be  thought  of  as  a  method  for 
classifying  itemsets  into  two  classes,  frequent  itemsets  (positive  instances)  and  infrequent  item- 
sets  (negative  instances).  A  confusion  matrix  [7]  gives  information  about  actual  and  predicted 
classifications  done  by  a  classification  system.  Performance  of  such  systems  is  commonly 
evaluated  using  terms  defined  on  the  matrix  data.  These  terms  are  listed  in  Table  3.  Here,  a  is 
the  number  of  correct  predictions  that  an  itemset  is  infrequent,  b  is  the  number  of  incorrect 
predictions  that  an  itemset  will  be  frequent,  c  is  the  number  of  incorrect  of  predictions  that  an 
itemset  will  be  infrequent,  and  d  is  the  number  of  correct  predictions  that  an  itemset  will  be 
frequent. 

Table  3:  Classifier  Evaluation  Terms 


Term 

L  Proportion  of 

SSHS-tBSS!!^ 

Accuracy  ( AC) 

total  number  of  predictions  that  were  correct 

(a+d)/(a+b+c+d) 

True  Positive  Rate  (TP) 

positive  cases  correctly  identified 

d/(c+d) 

False  Positive  Rate  (FP) 

negatives  cases  incorrectly  classified  as  positive 

True  Negative  Rate  (TN) 

negatives  cases  classified  correctly 

al(a-yb) 

False  Negative  Rate  (FN) 

positives  cases  incorrectly  classified  as  negative 

c/(c+d) 

Precision  ( P ) 

predicted  positive  cases  that  were  correct 

dl(b+d) 

Most  itemsets  in  an  itemset  lattice  are  infrequent  or  do  not  occur  at  all.  Accuracy 
may  not  be  an  adequate  performance  measure  when  the  number  of  negative  cases  is 
much  greater  than  the  number  of  positive  cases  [8]. 

We  use  ROC  graphs  to  examine  the  general  behavior  of  the  PIAB  and  PCAC  algo¬ 
rithms  with  variation  of  the  threshold  factor.  An  ROC  graph  is  a  plot  with  the  true 
positive  rate  (TP)  on  the  Y  axis  and  the  false  positive  rate  (FP)  on  the  X  axis  [13]. 
Point  (0,1)  is  the  perfect  classifier:  classifying  all  cases  correctly.  Point  (0,0)  repre¬ 
sents  a  classifier  that  predicts  all  cases  to  be  negative,  while  the  point  (1,1)  corre¬ 
sponds  to  a  classifier  that  predicts  every  case  to  be  positive.  Point  (1,0)  is  the  classi¬ 
fier  that  is  incorrect  for  all  classifications.  The  CAC  and  IAB  algorithms  produces  a 
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single  ROC  point,  a  ( FP,TP )  pair  for  a  particular  data  set.  In  the  PCAC  and  PIAB 
algorithms,  each  TF  value  produces  a  ( FP,TP )  pair  for  the  data  set.  A  series  of  such 
pairs  can  be  used  to  plot  an  ROC  curve. 

It  has  been  suggested  that  the  area  beneath  an  ROC  curve  can  be  used  as  a  measure 
of  accuracy  in  many  applications  [13].  In  [11],  it  is  argued  that  using  classification 
accuracy  to  compare  classifiers  is  not  adequate  unless  cost  and  class  distributions  are 
completely  unknown  and  a  single  classifier  must  be  chosen  to  handle  any  situation. 
They  propose  evaluation  of  classifiers  using  a  ROC  graph  and  imprecise  cost  and  class 
distribution  information. 

Itemset  algorithms  must  have  sufficient  generality  that  they  can  be  applied  to  any 
transaction  database  and  so,  we  choose  not  to  include  assumptions  regarding  cost  and 
class  distributions  in  our  analysis.  ROC  graphs  provide  a  visual  tool  for  examining  the 
tradeoff  between  the  ability  of  a  classifier  to  correctly  identify  positive  cases  and  the 
number  of  negative  cases  incorrectly  classified.  We  use  ROC  graphs  to  examine  algo¬ 
rithm  behavior,  without  making  claims  about  the  relative  accuracy  of  the  parametric 
algorithms. 

5  Experimental  Results 

All  experiments  were  carried  out  using  a  450  Mhz  Pentium  II  PC  with  384  MB  of 
RAM.  Test  data  were  drawn  from  an  eight  million  record  customer  database  pro¬ 
vided  to  us  by  a  commercial  partner.  Transaction  information  is  contained  in  a  three 
million  record  table  representing  the  purchase  of  2200  unique  items  by  over  500,000 
customers.  Five  discovery  tasks  were  performed  on  different  subsets  of  the  data. 
Baseline  information  was  obtained  using  the  ZSP  algorithm  [4],  which  is  guaranteed 
to  find  all  frequent  itemsets  but  requires  a  large  amount  of  space  and  time.  By  know¬ 
ing  the  true  answers,  we  are  able  to  evaluate  the  behavior  and  performance  of  our 
parametric  algorithms.  A  share  threshold  of  0.02  was  used,  since  this  produced  tasks 
small  enough  to  run  the  ZSP  algorithm  on  but  not  too  small  to  be  interesting.  Infor¬ 
mation  about  the  tasks  is  contained  in  Table  4.  Tasks  Tl,  T2,  T3  and  T4  were  run  for 
threshold  factors  1.0,  0.9,  0.8,  0.7,  0.6,  0.5,  0.4,  0.3,  0.2,  0.1  and  0.0.  T5  was  tested 
for  threshold  factors  1.0,  0.9,  0.8,  0.7,  0.6, 0.5,  0.1  and  0.0. 


Table  4:  Task  Metrics 


Task  Identifier 

Tl 

T2 

T3 

T4 

T5 

Transaction  Count 

599 

5092 

14257 

Record  Count 

1844 

4373 

■EZSH 

24725 

43700 

Figure  2  shows  the  ROC  graphs  for  PCAC  and  PIAB  for  four  of  the  five  tasks. 
Only  the  upper  portions  of  the  graphs  are  shown  because  this  includes  all  ROC  points. 
The  ROC  graph  for  T3  is  not  shown  because  both  algorithms  found  all  frequent  item- 
sets  for  all  values  of  TF.  Decreasing  the  threshold  factor  simply  moves  successive 
ROC  points  further  to  the  right  on  the  top  axis  of  the  ROC  graph. 

At  TF  equal  to  1.0,  the  PCAC  and  PIAB  algorithms  are  equivalent  to  CAC  and 
IAB.  As  TF  is  decreased  towards  0.0,  the  share  frequency  criterion  becomes  less 
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stringent  and  more  of  the  itemsets  that  are  generated  may  be  selected  for  counting. 
The  general  trend  is  for  the  ROC  points  to  move  upward  and  to  the  right  as  the  thresh¬ 
old  factor  is  decreased.  However,  the  true  positive  rate  tends  to  move  up  in  a  series 
of  steps.  This  is  best  illustrated  in  Figure  2(a)  where  the  initial  ROC  points  for  the 
PIAB  algorithm  in  T1  move  to  the  right  but  not  upward.  The  true  positive  rate  for  the 
PIAB  algorithm  remains  the  same  until  the  threshold  factor  equals  0.4.  Similar  be¬ 
havior  is  displayed  for  the  PCAC  algorithm,  although  it  seems  more  prevalent  in  the 
IAB  algorithm.  The  step  behavior  of  the  true  positive  rate  occurs  because  a  decrease 
in  threshold  factor  will  have  no  effect  until  certain  discrete  events  occur.  Additional 
frequent  itemsets  must  be  generated,  predicted  to  be  frequent,  not  subset  pruned  for 
the  PCAC  algorithm  and  counted.  The  false  positive  rate  always  increased  with  a 


2(c):T4  2(d):T5 

Figure  2:  Task  ROC 


decrease  in  the  threshold  factor,  although  in  some  cases,  the  increase  was  small. 

In  the  PCAC  algorithm,  all  counted  itemsets  are  used  to  generate  potential  candi¬ 
date  itemsets.  As  the  threshold  factor  is  decreased,  the  number  of  counted  itemsets 
increases  and  thus,  more  itemsets  are  available  for  itemset  generation  and  fewer  item- 
sets  are  subset  pruned.  For  TF  =  0,  the  PCAC  accepts  any  itemset  that  is  generated, 
provided  that  none  of  its  subsets  have  been  zero  pruned,  so  the  CAC  algorithm  be¬ 
comes  equivalent  to  the  ZSP  algorithm,  with  TPKfc  =  TPzsp  =  1  and  FP PCAC  =  FP zsp.  In 
other  words,  if  the  threshold  factor  is  decreased  far  enough,  the  algorithm  is  guaran¬ 
teed  to  find  all  frequent  itemsets.  On  the  other  hand,  the  PIAB  algorithm  does  not 
guarantee  this.  The  upper  limit  of  TP  equals  1  and  this  limit  was  reached  in  T2  and 
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T5.  However,  in  T1  and  T4,  TPplAB  is  less  than  1  for  a  threshold  factor  of  0.0.  Item- 
sets  are  generated  by  adding  single  items  to  frequent  itemsets  from  the  previous  pass. 
If  the  set  of  frequent  itemsets  for  the  k'h  pass  does  not  change  with  a  decrease  in  the 
threshold  factor,  then  the  foundation  of  the  generated  itemsets  remains  the  same. 
Once  all  itemsets  that  can  be  generated  by  adding  single  items  to  frequent  itemsets  are 
counted,  the  true  and  false  positive  rates  are  at  their  maximum.  Thus,  there  is  a  trade¬ 
off  between  algorithms.  The  PCAC  algorithm  will  find  all  frequent  itemsets,  provided 
the  threshold  factor  is  low  enough,  but  it  may  count  many  infrequent  itemsets.  The 
PIAB  algorithm  counts  fewer  infrequent  itemsets  at  lower  threshold  factors  but  may 
not  find  some  frequent  itemsets,  regardless  of  the  threshold  factor. 

The  degree  of  variation  between  the  ROC  curves  for  different  tasks  run  against  the 
same  transaction  database  indicates  that  we  may  not  be  able  to  provide  a  single  ROC 
curve  that  is  applicable  to  all  tasks  or  transaction  databases.  Further  tests  against  dif¬ 
ferent  transaction  databases  will  be  required  to  confirm  or  refute  this.  However,  the 
general  shape  of  the  ROC  curve  seems  consistent  enough  to  provide  the  user  with 
insight  into  the  general  behavior  that  can  be  expected  as  the  threshold  factor  is  varied. 

6  Conclusions 

We  described  the  incorporation  of  a  threshold  factor  into  algorithms  for  discovering 
share  frequent  itemsets,  including  those  that  contain  infrequent  subsets.  The  threshold 
factor  can  be  used  to  increase  (decrease)  the  effectiveness  of  the  algorithms  at  a  cost 
of  an  increase  (decrease)  in  the  number  of  infrequent  itemsets  examined.  The  CAC 
and  IAB  are  useful  heuristic  algorithms  for  finding  share  frequent  itemsets.  This  study 
indicates  that  a  threshold  factor  can  be  successfully  incorporated  into  these  algorithms 
to  increase  the  number  of  frequent  itemsets  found. 
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Abstract.  Since  early  1980’s,  the  rapid  growth  of  hospital  information 
systems  stores  the  large  amount  of  laboratory  examinations  as  data¬ 
bases.  Thus,  it  is  highly  expected  that  knowledge  discovery  and  data 
mining(KDD)  methods  will  find  interesting  patterns  from  databases  as 
reuse  of  stored  data  and  be  important  for  medical  research  and  practice 
because  human  beings  cannot  deal  with  such  a  huge  amount  of  data.  Ho¬ 
wever,  there  are  still  few  empirical  approaches  which  discuss  the  whole 
data  mining  process  from  the  viewpoint  of  medical  data.In  this  paper, 
KDD  process  from  a  hospital  information  system  is  presented  by  using 
two  medical  datasets.  This  empirical  study  show  that  preprocessing  and 
data  projection  are  the  most  time-consuming  processes,  in  which  very 
few  data  mining  researches  have  not  dicussed  yet  and  that  application 
of  rule  induction  methods  is  much  easier  than  preprocessing. 


1  Introduction 

Medical  practice  and  research  has  been  changed  by  rapid  growth  of  life  science, 
including  biochemistry  and  immunology  (Levinson,  1996).  The  mechanism  of  a 
disease  can  be  explained  as  a  biochemical  process  or  cell  disorder  and  the  dia¬ 
gnostic  accuracy  of  medical  experts  is  increasing  due  to  the  development  of 
laboratory  examinations.  However,  it  is  also  true  that  data  analysis  is  very  in¬ 
dispensable  to  generating  a  hypothesis.  For  instance,  discovery  of  HIV  infection 
and  Hepatitis  type  C  were  inspired  by  analysis  of  clinical  courses  unexpected 
by  experts  on  immunology  and  hepatology,  respectively  (Fauci  et  al,  1997).  Alt¬ 
hough  the  life  science  has  been  rapidly  advanced,  mechanisms  of  many  diseases 
are  still  unknown:  especially,  neurological  diseases  were  very  difficult  to  analyze 
because  their  prevalence  is  very  low  (Adams  and  Victor,  1993).  Even  the  me¬ 
chanism  of  diseases  with  high  prevalence,  such  as  cancer,  is  partially  known  to 
medical  experts.  In  this  sense,  medical  research  always  need  a  good  hypothesis, 
which  is  one  of  the  most  important  motivations  to  data  mining  and  knowledge 
discovery  for  medical  people. 

Also,  another  aspects  interest  medical  researchers  in  data  mining.  Since  early 
1980’s,  the  rapid  growth  of  hospital  information  systems  (HIS)  stores  the  large 
amount  of  laboratory  examinations  as  databases  (Van  Bemmel,  and  Musen, 
1997).  For  example,  in  a  university  hospital,  where  more  than  1000  patients  visit 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  573-581,  2000. 
©  Springer- Verlag  Berlin  Heidelberg  2000 


574  S.  Tsumoto 


from  Monday  to  Friday,  a  database  system  stores  more  than  1  GB  numerical  data 
of  laboratory  examinations  for  each  year.  Furthermore,  storage  of  medical  image 
and  other  types  of  data  are  discussed  in  medical  informatics  as  research  topics 
on  electronic  patient  records  and  all  the  medical  data  will  be  stored  in  hospital 
information  systems  within  the  21th  century.  Thus,  it  is  highly  expected  that 
data  mining  methods  will  find  interesting  patterns  from  databases  as  reuse  of 
stored  data  and  be  important  for  medical  research  and  practice  because  human 
beings  cannot  deal  with  such  a  huge  amount  of  data. 

In  this  paper,  knowledge  discovery  and  data  mining  (KDD)  process  (Fayyad, 
et.  al,  1996)  for  two  medical  datasets  extracted  from  a  hospital  information 
system  is  presented.  This  empirical  study  show  that  preprocessing  and  data 
proejction  are  the  most  time-consuming  processes,  in  which  very  few  data  mining 
researches  have  not  dicussed  yet  and  that  application  of  rule  induction  methods 
is  much  easier  than  preprocessing. 

2  Data  Selection 

In  this  paper,  we  use  the  following  two  datasetsfor  data  mining,  which  are  ex¬ 
tracted  from  two  different  hospital  information  systems.  One  is  bacterial  test 
data,  which  consists  of  101,343  records,  254  attributes.  This  data  includes  past 
and  present  history,  physical  and  laboratory  examinations,  diagnosis,  therapy, 
a  type  of  infectious  disease,  detected  bacteria,  and  sensitivities  for  antibiotics. 
The  other  one  is  a  dataset  on  the  side  effect  of  steroid,  which  consists  of  31,119 
records,  287  basic  attributes.  This  data  includes  past  and  present  history,  physi¬ 
cal  and  laboratory  examinations,  diagnosis,  therapy,  and  the  type  of  side  effects. 
The  characteristic  of  the  second  dataset  is  that  it  is  a  temporal  database:  alt¬ 
hough  it  includes  287  basic  attributes,  213  attributes  of  which  have  more  than 
100  temporal  records.  These  datasets  are  obtained  through  the  first  to  third 
steps  of  KDD  process:  data  selection,  data  cleaning  and  data  reduction. 

In  the  first  step  of  KDD  process,  these  databases  are  extracted  from  two 
different  hospital  information  systems  by  simple  queries.  Table  1  gives  results 
for  data  selection:  the  second  column  shows  the  total  size  of  HISs  when  data  were 
selected  (December,  1998).  The  third  column  presents  the  size  of  data  selected 
from  HIS.  Finally,  the  fourth  column  gives  the  computational  time  required  for 
data  selection.  Since  each  HIS  is  implemented  on  different  computers,  it  may 
be  difficult  to  compare  each  computational  time,  but  those  values  suggests  that 
time  is  not  dependent  on  the  selected  data,  but  on  the  total  HIS  size. 


Table  1.  Data  Selection  Results 

Datasets  HIS  size  Target  Data  Time  Required 

Bacterial  Test  1,275,242(52GB)  361,932(14GB)  2.3  Days 

Side-Effect  2,631,549(100GB)  135,749(6GB)  7.3  Days 
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3  Data  Cleaning  as  Preprocessing 

After  data  selection,  data  cleaning  is  required  since  the  data  obtained  from  the 
first  step  are  not  clean,  including  data  records  not  suitable  for  data  analysis. 
Even  though  these  records  are  selected  by  matching  the  query  condition,  they 
do  not  include  enough  amount  of  information.  In  the  case  of  bacterial  tests,  some 
patients  have  information  about  bacterial  tests,  but  they  made  very  few  labora¬ 
tory  examinations.  On  the  other  hand,  in  the  case  of  side  effects,  some  patients 
have  a  record  on  steroid  therapy,  but  they  made  two  or  three  examinations  for 
each  year  since  the  status  of  allergic  diseases  was  stable  and  the  patients  do  not 
suffer  from  any  side-effects.  These  data  may  be  removed  if  the  queries  in  the 
first  step  are  refined.  However,  it  should  be  pointed  out  that  the  refinement  of 
queries  is  not  so  a  easy  task  because  we  have  many  types  of  insuitable  data, 
which  is  difficult  to  predict  and  will  occur  not  only  due  to  the  factors  of  pati¬ 
ents  but  also  due  to  the  factors  of  medical  doctors.  For  example,  some  patients 
may  not  come  to  the  outpatient  clinic  when  they  recover  from  allergic  disorders. 
Some  doctors  may  tell  the  patients  not  to  come  to  the  clinic  so  often.  In  some 
cases,  some  patients  are  found  to  suffer  from  very  a  severe  disease  and  they 
should  be  admitted  to  a  university  hospital.  In  other  cases,  patients  may  move 
from  a  university  hospital  to  a  municipal  hospital.  Many  factors  not  included  in 
a  database,  mainly  social  factors,  will  be  a  cause  for  degrading  the  cleaness  of 
data. 

Thus,  it  is  much  easier  to  clean  the  data  not  by  using  complex  queries  but 
by  using  simple  statistics.  Since  each  domain  has  attributes  indispensabe  to  eva¬ 
luating  the  status  of  patients,  the  number  of  attributes  used  to  describe  each 
record  is  a  good  index  for  removing  not-clean  records.  So,  we  define  the  two-fold 
cleaning  steps:  first  we  select  the  records  which  have  no  missing  values  in  the 
pre-defined  indispensable  attributes.  Then,  we  calculate  how  many  attributes  in 
the  remaining  attributes  are  used  to  describe  the  records  in  the  first  step.  If  the 
number  of  attributes  used  for  a  case  is  not  sufficient,  then  this  case  will  removed. 
For  those  steps,  the  indispensable  attributes  and  the  threshold  for  the  second 
selection  are  given  a  priori  by  domain  experts.  In  the  case  of  bacterial  tests,  254 
attributes  are  very  important.  Furthermore,  27  attributes  are  indispensable  to 
describe  each  case.  Thus,  in  the  data  cleaning  step,  first  select  the  records  which 
have  missing  values  in  the  27  attributes.  Then,  calculate  how  many  attributes 
have  missing  values  in  remaining  227  attributes.  If  75%  of  them  are  missing 
,  then  remove  these  records.  In  the  case  of  steroid  side-effects,  the  same  stra¬ 
tegy  is  applied.  74  non-temporal  attributes  are  indispensable  and  217  temporal 
attributes  are  used  for  second  selection. 

Table  2.  Data  Cleaning  Results 

Datasets  Data  Cleaned  Data  Time  Required 

Bacterial  Test  361,932(14GB)  101,343(3GB)  572  Days 

Side-Effect  135,749(6GB)  31, 119(1, 5GB)  2.5  Days 
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Table  2  summerizes  results  for  data  cleaning.  The  second  column  shows  the 
number  of  records  selected  in  the  data  selection.  The  third  column  shows  the 
size  of  data  after  the  data  cleaning  process  is  applied.  The  fourth  column  gives 
the  computational  time  required  for  the  whole  steps. 


4  Data  Reduction  and  Projection  as  Preprocessing 

4.1  Data  Projection  for  Bacterial  Test  Database 

From  the  viewpoint  of  table  processing,  data  cleaning  can  be  viewed  as  cleaning 
steps  in  the  direction  of  records,  that  is,  in  the  direction  of  row.  On  the  other 
hand,  data  reduction  can  be  viewed  as  cleaning  steps  in  the  direction  of  attri¬ 
butes,  that  is,  in  the  direction  of  column.  Although  the  data  cleaning  process  is 
a  time-consuming  process,  it  takes  much  more  time  to  reduct  and  project  data 
in  clinical  databases  due  to  the  characteristics  of  biological  science,  including 
medicine.  The  tradition  of  classification  in  biology  tends  to  have  a  large  scale  of 
classification  systems.  For  example,  let  us  consider  the  classification  of  bacteria. 
If  we  look  at  the  classification  tree  of  bacteria,  more  than  millions  of  bacteria  are 
classified  neatly  in  one  classification  tree.  However,  the  number  of  bacteria  on 
which  we  want  to  focus  in  medicine  is  very  few,  compared  with  this  total  classi¬ 
fication.  But,  even  the  number  of  bacteria  used  in  bacterial  tests  are  too  many, 
compared  with  the  total  number  of  classes  used  in  data  mining  techniques.  In 
the  case  of  bacterial  test  database,  1194  kinds  of  bacteria  and  other  bioorganism 
are  used  as  a  target  class.  If  conventional  classification  or  statistical  methods 
are  applied  to  this  data,  most  of  induced  rules  or  patterns  may  be  useless  be¬ 
cause  these  methods  tend  to  extract  knowledge  for  differentiation  between  1194 
classes.  These  tendencies  are  unexpected  for  medical  experts  who  want  to  have 
more  generalized  results. 

Thus,  generalization  of  such  over-classified  attributes  is  required  to  discover 
rules  which  are  easy  for  medical  experts  to  interpret.  For  bacterial  test  data¬ 
bases,  a  simple  concept  hierarchy  was  used  for  generalization  of  values.  All  the 
bioorganism  are  classified  into  1124  bacteria  and  70  non-bacteria  bioorganism. 
Then,  1124  bacteria  are  classified  into  aerobic  (973)  and  anaerobic  (151).  /.From 
the  levels  below,  which  is  not  shown  in  this  figure,  conventional  bacterial  clas¬ 
sification  sytem  is  used  for  construction  of  tree,  into  which  totally  five  levels  of 
hierarchy  is  implemented.  For  each  data  mining  task,  we  set  up  which  hierarchi¬ 
cal  level  is  suitable  for  data  analysis.  Although  data  projection  for  this  dataset 
is  parallel  with  data  mining  process,  we  show  how  much  time  is  needed  for  each 
data  projection  in  Table  3. 

The  first  column  denotes  the  level  of  concept  hierarchy  used  for  generaliza¬ 
tion.  The  second  one  shows  the  number  of  total  values  for  compairson.  Ther  third 
column  gives  the  number  of  total  values  for  generalization. The  fourth  column 
gives  computational  time  for  generalization.  Finally,  the  fifth  column  gives  time 
for  construction  of  a  hierarchical  tree  by  interaction  between  data  analyzers  and 
domain  experts.  Especially,  since  a  construction  of  hierarchical  tree  can  be  vie¬ 
wed  as  a  knowledge  acquisition  process,  it  cannot  be  automated.  In  addition  to 
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Table  3.  Generalization  of  Bacteria 


Projection  Level  Generalized  Values  Time  Computation  Time  Construction 


2.  (Bacteria) 

2 

24  hours 

1.0  Days 

3. 

5 

25  hours 

1.4  Days 

4 

52 

26  hours 

3.0  Days 

5 

175 

27  hours 

7.0  Days 

6 

1194 

0 

- 

this  hierarchical  tree,  the  following  hiearchical  tre  is  also  needed  for  generaliza¬ 
tion  of  this  dataset:  a  classification  tree  for  chronic  diseases  which  each  patient 
suffers  from.  This  process  also  takes  7  days  to  complete  the  tree  structure  with 
domain  experts. 

4.2  Data  Reduction  for  Steroid  Side-Effect  Database 

Characteristics  of  Medical  Temporal  Databases.  Since  incorporating  tem¬ 
poral  aspects  into  databases  is  still  an  ongoing  research  issue  in  database  area 
(Abiteboul,  et.  al.,  1995),  temporal  data  are  stored  as  a  table  in  hospital  informa¬ 
tion  systems  (H.I.S.)  with  time  stamps.  The  characteristics  of  medical  temporal 
data  are  as  follows (Tsumoto  1999):  (l)The  Number  of  Attributes  are  too  many, 
(2)  Irregularity  of  Temporal  Intervals,  and  (3)Missing  Values. 


Data  Reduction  using  Moving  Average  Method.  The  way  to  how  to 
deal  with  medical  temporal  databases  is  discussed  in  (Tsumoto,  1999).  Tsumoto 
introduces  extended  moving  average  method,  which  automatically  set  up  the 
scale  of  the  temporal  interval.  For  example,  if  the  scale  factor  is  set  to  be  2,  then 
the  temporal  interval  for  moving  average  is  calculated  from  2,  4,  8,  16,  and  so  on. 
Each  temporal  interval  is  called  window.In  general,  let  s  and  yi  denote  a  scale 
factor  and  a  value  for  laboratory  test.  Then  moving  average  for  y  is  defined  as: 


Vw 


y'  y± 

w 


where  n  denotes  an  integer  which  is  set  up  for  temporal  interval.  Thus,  sn  gives 
the  size  of  window.  One  of  the  disadvantages  of  moving  average  method  is  that 
it  cannot  deal  with  categorical  attributes.  To  solve  this  problem,  we  will  classify 
categorical  attributes  into  three  types,  whose  information  should  be  given  by 
users.  The  first  type  is  constant,  which  will  not  change  during  the  follow-up 
period.  The  second  type  is  ranking,  which  is  used  to  rank  the  status  of  a  patient. 
The  third  type  is  variable,  which  will  change  temporally,  but  ranking  is  not 
useful.  For  the  first  type,  extended  moving  average  method  will  not  be  applied. 
For  the  second  one,  integer  will  be  assigned  to  each  rank  and  extended  moving 
average  method  for  continuous  attributes  is  applied.  On  the  other  hand,  for  the 
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third  one,  the  temporal  behavior  of  attributes  is  transformed  into  statistics  by 
using  frequencies. 

For  further  discussion  on  data  reduction  of  temporal  data,  the  readers  may 
refer  to  (Tsumoto,  1999). 


Table  4.  Computational  Time  for  Data  Summarization 

Window  Size  Computational  Time 
2'  (=  128)  12.0  hours 

2®(=  256)  8.0  hours 

oo  7.0  hours 


Results  of  Data  Reduction  for  Steroid  Side  Effects.  Steroid  side-effects 
is  known  as  long-term  side-effects,  usually  observed  when  a  patient  takes  steroid 
for  more  than  several  years.  Thus,  to  capture  long-term  effects,  the  window 
size  is  set  to  27(=  128)  and  28(=  256).  It  is  true  that  a  significant  amount  of 
temporal  information  is  lost  by  using  data  reduction,  but  we  should  remember 
that  the  first  objective  of  data  mining  is  to  find  simple  useful  and  unexpected 
patterns  from  data.  As  discussed  in  4.2.1,  medical  temporal  data  suffer  from 
many  types  of  irregularities.  Table  4  shows  the  computational  time  required  for 
data  summerization.  It  is  notable  that  this  table  shows  the  trade-off  relationship 
between  the  window-size  and  computational  time:  if  the  window-size  is  smaller, 
the  computational  time  grows  much  larger. 


4.3  Total  Time  Required  for  Data  Reduction 

In  summary,  Table  5  gives  the  total  time  required  for  data  reduction  and  pro¬ 
jection  for  each  data  set,  including  knowledge  acquisition  process.  The  second 
column  gives  the  type  of  preprocessing.  The  third  column  shows  total  time  re¬ 
quired  for  each  process.  Finally,  the  fourth  column  shows  the  time  required  for 
acquisition  of  knowledge  from  domain  experts. 


Table  5.  Total  Time  Required  for  Data  Reduction 

Dataset  Preprocessing  Total  Time  Time  for  Acquisition 

Bacterial  Test  Projection  15.25  Days  7.0  Days 

Side-Effect  Summarization  2.3  Days  0 


This  table  suggests  the  generalization  of  values  in  attributes  should  be  a 
time-consuming  process,  especially  when  domain  knowledge  is  given. 
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5  Rule  Induction  as  Data  Mining 

After  the  third  step,  rule  induction  based  on  rough  set  model  (Pawlak,  1991)  was 
applied  to  two  medical  datasets.  Tsumoto  (1998,  2000)  extends  rough-set-based 
rule  induction  methods  into  probabilistic  domain. 

In  this  section,  we  skip  this  part  due  to  the  limitation  of  space.  For  further 
discussion,  the  readers  may  refer  to  (Tsumoto,  1998;  Polkowski  and  Skowron, 
1998).  The  algorithm  introduced  in  (Tsumoto  1999)  was  implemented  on  the 
Sun  Spaarc  station  and  was  applied  to  the  above  two  medical  databases,  the 
information  of  which  is  summarized  in  Table  6.  For  rule  induction,  the  thresholds 
for  accuracy  and  coverage  are  set  to  0.5  and  0.5,  respectively. 

For  bacterial  test  databases,  six  attributes  for  which  domain  experts  want  to 
find  simple  pattern  are  assigned  to  decision  attributes.  For  side-effect  databases, 
one  attribute  (side-effect)  is  assigned  to  a  decision  attribute.  Table  7  summarizes 
the  results  of  data  mining.  The  second  and  third  columns  give  information  about 
data.  The  fourth  column  shows  the  number  of  rules  induced  by  rule  induction 
methods.  Finally,  the  fifth  column  presents  the  computational  time  required.  It 
is  notable  that  the  computational  time  is  rather  small,  compared  with  compu¬ 
tational  time. 


Table  6.  Summary  of  Data  Mining 

Data  Size  Attributes  Rules  Computational  Time 

Bacterial  Test  101,343  (3GB)  254  24,335  60  hours  (2.5  Days) 
Side-Effects  31,119  (1.5GB)  287  14,715  18  hours  (0.75  Days) 


6  Interpretation  of  Induced  Rules 

After  the  data  mining  step,  we  obtain  many  rules  to  be  interpreted  by  medical 
experts.  Even  if  the  amount  of  information  is  very  small  compared  with  the 
original  databases,  it  still  takes  about  one  week  to  evaluate  all  the  induced  rules. 

6.1  Induced  Rules  of  Bacterial  Tests 

Of  24,335  rules,  only  114  rules  are  unexpected  or  interesting  to  medical  experts. 
From  these  discovered  results,  nine  rules  are  shown  below. 

1.  /3-lactamase(+)  — »  Bacteria_Detection  (+) 

2.  3- 1  act  am  ase  ( 3 + )  — >  Bacteria_Detection  (+) 

These  two  results  are  interesting  from  the  viewpoints  of  history  of  bacterial 
infection.  Since  penicillin  has  been  introduced  as  antibiotics,  many  bacteria  have 
acquired  to  generate  enzymes  that  decompose  penicillin,  called  /3-lactamase.  The 
above  two  results  show  that  such  penicillin-resistant  bacteria  can  be  more  easily 
found  than  penicillin-sensitive  ones. 
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3.  Pneumonia  — >  BacteriaJDetection  (-) 

4.  Fever  (BT>39)  — »  Bacteria_Detection(-) 

5.  Malignant  Tumor  — >  Bacteria_Deteetion  (-) 

These  three  results  are  unexpected  by  medical  experts.  As  for  the  third  rule,  it 
is  well  known  that  bacterial  infection  is  the  main  cause  for  pneumonia.  However 
even  if  pneumonia  comes  from  the  bacterial  infection,  it  would  be  difficult  to 
detect  bacteria.  Concerning  the  fourth  rule,  high  fever  suggests  that  the  degree 
of  infection  is  high.  However,  it  would  be  difficult  to  detect  bacteria  even  if  the 
degree  is  high.  Finally,  in  the  case  when  a  patient  suffers  from  malignant  tumor, 
he/she  may  suffer  from  a  severe  infection  due  to  immunological  insufficiency. 
However,  it  may  be  difficult  to  detect  bacteria  for  this  case. 

6.2  Induced  Results  of  Steroid  Side-Effects 

Of  14,715  rules,  only  106  rules  are  unexpected  or  interesting  to  medical  experts. 
From  these  discovered  results,  four  rules  are  shown  below  For  simplicity,  these 
rules  are  given  in  the  summarized  form,  though  they  are  originally  represented 
as  the  conjunction  of  temporal  attributes. 

1.  [Steroid>3.0years]  &  [Headache (+)> 0.5]  — >  Glaucoma 

This  rule  shows  that  headache  is  an  important  sign  for  glaucoma  due  to 
steroid  side-effect. 

2..  [Steroid > 2. 5years]  &  [Blurred  Vision(+)>0.75]  — »  Cataracta 

This  result  shows  that  if  a  patient  takes  steroid  for  more  than  two  years,  the 
side  effects  may  be  frequently  observed.  Those  unexpected/interesting  rules  are 
also  feedbacked  to  the  university  hospital,  which  donates  a  dataset  to  us,  and 
the  staff  in  the  hospital  is  evaluating  them. 

7  Discussion:  Time  Required  for  KDD  Process 

After  the  data  interpretation  phase,  about  one  percent  of  induced  rules  are  found 
to  be  interesting  or  unexpected  to  medical  experts.  In  this  section,  the  total  KDD 
process  is  reviewed  with  respect  to  computational  time. 

Table  7  shows  the  total  time  required  for  KDD  process.  Each  column  shows 
two  data  sets  for  each  process.  Each  row  includes  computational  time  required 
for  each  process.  Totally,  it  takes  about  one  month  and  three  weeks  to  complete 
the  whole  KDD  process  for  bacterial  test  database  and  side-effect  database, 
respectively.  It  is  notable  that  more  than  60%  of  the  process  is  devoted  to  the 
three  preprocessing  processes:  data  selection,  cleaning  and  reduction.  Especially, 
as  for  the  bacterial  test  dataset,  22.75  days  (79.5%)  are  used  for  three  processes. 
It  is  because  domain  knowledge  should  be  acquired  for  generalization  of  data, 
as  discussed  in  Section  3.  On  the  other  hand,  only  4  to  8  percent  of  total  time 
is  spent  for  data  mining  process.  In  the  case  of  bacterial  test  database,  only  2.5 
days  (8%)  is  used  for  rule  induction.  Therefore,  these  empirical  results  suggest 
that  the  main  KDD  processes  should  be  preprocessing  rather  than  discovery  of 
patterns  from  data. 
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Another  important  point  is  that  data  interpretation  is  also  time-consuming 
process,  compared  with  data  mining  process  because  it  needs  interpretation  by 
domain  experts.  In  summary,  if  we  want  to  make  KDD  process  faster,  then 
we  should  consider  the  automation  of  processes  which  needs  interaction  bet¬ 
ween  computers  and  domain  experts.  Especially  if  domain  knowledge  is  easily 
incorporated  into  the  program,  the  computational  time  for  the  third  step  may 
be  significantly  improved.  Therefore,  more  intensive  research  on  automation  of 
data  preprocessing  is  required  for  future  research. 


Table  7.  Total  Time  Required  for  KDD  process 


KDD  Process  Bacterial  Test  Side-Effect 


Data  Selection 

2.3 

7.3 

Data  Cleaning 

5.2 

2.5 

Data  Reduction 

15.25 

2.3 

Data  Mining 

2.5 

0.75 

Data  Interpretation 

7.0 

7.0 

Total  Time 

32.25 

19.85 
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Abstract.  Empirical  equations  and  rules  are  important  classes  of  regu¬ 
larities  that  can  be  discovered  in  databases.  We  concentrate  on  their  role 
as  definitions  of  attribute  values.  Such  definitions  can  be  used  in  many 
ways  in  a  single  database  and  for  transfer  of  knowledge  between  databa¬ 
ses.  We  analyze  quests  for  definitions  of  an  attribute  in  a  given  database. 
A  quest  triggers  a  discovery  mechanism  that  specializes  in  searching  re¬ 
cursively  a  system  of  databases  and  returns  a  set  of  partial  definitions. 
We  introduce  the  notion  of  shared  operational  semantics  founded  on 
an  equation-based  and  rule-based  system  of  partial  definitions.  It  gives 
necessary  foundations  for  designing  local  query  answering  systems  in  a 
distributed  knowledge  system  ( DKS ). 


1  Shared  Semantics  for  Distributed  Autonomous  DBs 

In  many  fields,  such  as  medical,  manufacturing,  banking,  military  and  educatio¬ 
nal,  similar  databases  are  kept  at  many  sites.  Each  database  stores  information 
about  local  events  and  uses  attributes  suitable  for  locally  collected  information, 
but  since  the  local  situations  are  similar,  the  majority  of  attributes  are  compa¬ 
tible  among  databases.  Yet,  an  attribute  may  be  missing  in  one  database,  while 
it  occurs  in  many  others.  For  instance,  different  military  units  may  apply  the 
same  battery  of  personality  tests,  but  some  of  these  tests  may  be  not  used  in 
one  unit  or  another. 

Missing  attributes  lead  to  problems.  A  recruiter  new  at  a  given  unit  may 
query  a  local  database  to  find  candidates  who  match  a  desired  description, 
only  to  realize  that  one  component  (i\  of  that  description  is  missing  in  S\  so  that 
the  query  cannot  be  answered.  The  same  query  would  work  in  other  databases 
but  the  recruiter  is  interested  in  identifying  suitable  candidates  in  S\. 

1.1  System  Architecture 

Operational  semantics  introduced  in  [15]  provides  definitions  of  missing  attribu¬ 
tes  through  search  for  definitions  in  many  databases.  Figure  1  shows  the  archi¬ 
tecture  of  a  distributed  knowledge  system.  Discovery  Layer  for  each  database  is 
initially  formed  from  rules  and  equations  extracted  from  that  database.  They 


Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI  1932,  pp.  582-590,  2000. 
{ c )  Springer- Verlag  Berlin  Heidelberg  2000 
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define  some  of  the  attributes  by  other  attributes  in  the  same  database  and  are 
discovered  by  an  automated  process.  They  are  used  for  knowledge  exchange 
between  databases  and  jointly  form  an  integrated  semantics  for  a  distributed 
knowledge  system  which  defines  the  meaning  of  queries.  Query  answering  sy¬ 
stem  QAS  uses  definitions  extracted  at  other  databases  and/or  available  in  the 
local  discovery  layer  to  answer  queries  which  otherwise  would  not  be  reachable. 
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Fig.  1.  Distributed  Knowledge  System 


1.2  Links  to  Previous  Research 

QAS  is  a  natural  knowledge-discovery-based  extension  of  the  query  answering 
system  for  a  system  of  databases,  presented  in  [11],  [12],  [13].  In  these  papers  rules 
discovered  in  one  database  define  values  of  missing  attributes  in  other  databases. 
The  search  for  rules  can  use  many  strategies  which  find  rules  describing  decision 
attributes  in  terms  of  classification  attributes.  It  has  been  used  in  conjunction 
with  such  systems  like  LERS  (developed  by  J.  Grzymala-Busse  [3])  or  AQ15 
(developed  by  R.  Michalski  and  his  collaborators  [8]). 

The  task  of  integrating  established  database  systems  can  be  complicated  not 
only  by  the  differences  between  the  sets  of  attributes  but  also  by  differences  in 
structure  and  semantics  of  data.  The  notion  of  an  intermediate  model,  proposed 
by  Wiederhold  [7],  is  very  useful  in  dealing  with  such  a  problem,  because  it 
describes  the  database  content  at  a  relatively  high  abstract  level,  sufficient  for  a 
homogeneous  representation  of  all  databases.  In  our  paper  a  discovery  layer  can 
be  seen  as  an  application  of  the  ideas  of  an  intermediate  model  for  a  distributed 
DB  system. 

1.3  Operational  Definition 

Definitions  that  are  used  to  compute  attribute  values  of  objects  are  often  cal¬ 
led  operational  definitions.  They  are  common  in  science,  where  values  of  each 
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attribute  are  determined  in  many  ways,  depending  on  different  applications. 
Operational  semantics  has  been  introduced  by  Bridgman  [1]  and  developed  by 
Carnap  [2]  and  many  others,  including  semantics  of  coherent  sets  of  operational 
definitions  developed  by  Zytkow  [16]  and  applied  in  robotic  experiments  [18]. 

Operational  semantics  can  be  applied  to  databases.  Many  computational  me¬ 
chanisms  can  be  used  to  define  values  of  an  attribute.  We  call  them  operational 
definitions  because  each  is  a  mechanism  by  which  the  values  of  a  defined  attri¬ 
bute  can  be  computed.  Many  are  partial  definitions,  as  they  apply  to  subsets  of 
records  that  match  the  “if”  part  of  a  definition.  In  1989-1990,  Ras  et  al.  [6],  [14] 
introduced  a  mechanism  which  first  seeks  and  then  applies  as  definitions  rules  in 
the  form  “If  Boolean-expression(x)  then  a(x)=w”  which  are  partial  definitions 
of  attribute  a  applicable  to  all  objects  x  that  satisfy  Boolean-expression(x). 

The  49er  system  can  find  knowledge  in  many  forms,  including  equations, 
that  can  be  used  to  define  one  attribute  by  other  attributes  in  a  relational  table. 
We  conducted  experiment,  using  this  mechanism  in  addition  to  rule-based  defi¬ 
nitions  [4].  The  growing  interest  in  KDD  will  make  the  discovery  of  operational 
definitions  increasingly  popular.  Recently,  Prodromidis  &  Stolfo  [10]  argued  that 
attribute  definitions  are  a  useful  target  for  discovery  in  databases. 

1.4  Shared  Semantics  in  Action:  Query  Answering 

Many  query-answering  situations  can  benefit  from  the  following  generic  scenario. 
A  query  q  is  issued  at  database  S i ,  but  it  is  “unreachable”  in  Si  because  it  uses 
an  attribute  a  which  is  missing  in  5i .  A  request  for  a  definition  of  a  is  issued  to 
other  sites  in  the  distributed  autonomous  database  system.  The  request  specifies 
attributes  ai,  ...,an  available  at  Si.  When  attribute  a  and  a  subset  {a^ ,  ...,aikj 
of  {ai,...,an}  are  available  in  another  database  S2,  a  discovery  mechanism  is 
invoked  to  search  for  operational  definitions  at  S2 ,  by  which  values  of  a  can  be 
computed  from  values  of  some  of  aq,  ...,aifc.  If  discovered,  such  a  definition  is 
returned  to  the  discovery  layer  over  Si  and  used  to  compute  the  unknown  values 
of  a  that  occur  in  query  q. 

The  same  mechanism  can  apply  if  attribute  a  is  available  at  Si,  but  some 
values  of  a  are  missing.  In  that  case,  the  discovery  mechanism  can  be  applied  at 
Si,  if  the  number  of  the  available  values  of  a  is  sufficiently  large. 


1.5  Other  Applications 

Functional  dependencies  in  the  form  of  equations  are  a  succinct,  convenient  form 
of  knowledge,  useful  in  many  ways.  The  equation  a  =  f(ai,  0,2, ...,  am)  can  be 
directly  used  to  predict  values  a(x )  of  a  for  object  x  by  substituting  the  values 
of  ai{x),a,2{x),  ...,am{x)  if  all  are  available.  If  some  are  not  directly  available, 
they  may  be  predicted  by  other  operational  definitions. 

When  we  suspect  that  some  values  of  a  may  be  wrong,  an  equation  imported 
from  another  database  may  be  used  to  verify  them.  An  equation  acquired  at  the 
same  database  may  be  used,  too.  For  instance,  patterns  discovered  in  clean  data 
can  be  applied  to  distinguish  wrong  values  in  the  raw  data. 
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Equations  that  are  generated  at  different  sites  can  be  used  to  cross-check  the 
consistency  of  knowledge  and  data  coming  from  different  databases.  If  the  values 
of  a  that  are  computed  by  two  independent  equations  are  approximately  equal, 
this  confirms  consistency  of  both  definitions. 

All  equations  by  which  values  of  a  can  be  computed  expand  the  understan¬ 
ding  of  a.  Attribute  understanding  is  often  initially  inadequate  when  we  receive  a 
new  dataset  for  the  purpose  of  data  mining.  We  may  know  the  domain  of  values 
of  o,  but  we  do  not  understand  a’s  detailed  meaning,  so  it  is  difficult  to  apply 
background  knowledge  and  the  knowledge  discovered  about  a.  In  such  cases,  an 
equation  that  relates  a  poorly  understood  attribute  a  with  attributes  of  known 
meaning,  explains  some  of  the  meaning  of  a. 

2  Recursive  Search  for  Equations 

Let  us  present  in  algorithmic  details  a  recursive  discovery  mechanism  that  sup¬ 
ports  global  query  answering.  When  an  attribute  a  is  needed  but  unreachable  in 
database  S\,  a  request  for  a  definition  of  a  is  issued  to  other  sites  in  the  distri¬ 
buted  database  system.  The  request  specifies  the  attributes  ai, ...,  an  available 
in  S\ ,  because  only  those  attributes  can  be  included  in  a  definition  useful  at  S\ . 

In  this  section  we  present  a  recursive  algorithm  that  searches  for  equations 
and  we  analyze  an  application  os  this  algorithm.  But  that  algorithm  can  be  used 
to  search  for  rules.  In  section  3  we  will  present  an  example  of  recursive  search 
for  rules. 

When  the  attribute  a  and  a  subset  {oq , ...,  Ojfc}  of  {aj, ..., a„}  are  available 
in  another  database  S2,  49er’s  discovery  mechanism  is  invoked  to  search  S2  for 
equations  by  which  values  of  a  can  be  computed  from  values  of  some  of  , ...,  aik . 
If  discovered,  such  equations  are  returned  to  S\  and  can  be  used  in  numerous 
ways. 

In  [15]  we  considered  a  computational  mechanism  that  searches  at  each  da¬ 
tabase  individually  for  equations  suitable  in  a  role  of  definitions  of  a.  But  there 
are  numerous  situations  when  this  mechanism  must  be  expanded  and  applied 
recursively. 

2.1  Non-overlapping  Attribute  Sets 

First,  there  may  be  no  database  which  contains  a  and  any  of  {ai, . . . ,  an } .  This 
can  be  illustrated  with  the  following  example  of  simple  relation  schemas,  one 
relation  per  database: 

Si(ai,  a2, ...,  an)  ;  definition  of  attribute  a  is  sought 
S2{a,  bi, ...,  6fc) 

S3{bi,a2,  a3) 

Suppose  that  an  equation  a  —  f(b  1)  has  been  discovered  in  S2.  It  cannot  be 
used  in  Si,  because  61  is  unavailable.  But  S3  includes  b \  and  some  of  {ai, ...,  an } . 
An  equation  &i  =  /1  (02,03)  may  be  discovered  that  defines  fq  in  terms  02  and 
03.  That  equation  can  be  substituted  into  a  =  f(bi)  leading  to  equation  a  = 
/(/i (02,03))  that  can  be  applied  in  Si. 
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2.2  Search  for  a  Sufficient  Fit 

Second,  there  may  be  a  database  S4  that  includes  a  and  some  of  {ai,  ...,on}. 
But  no  equation  that  defines  a  through  any  of  {aq, an}  has  a  fit  sufficient  to 
play  the  role  of  a  definition.  In  this  situation,  the  search  for  a  definition  can  be 
expanded.  Perhaps  an  equation  is  discovered  that  has  a  sufficient  fit  to  play  the 
role  of  a  definition,  but  in  addition  to  some  of  {01, ...,  an}  it  uses  b\,  unavailable 
in  S\.  We  already  discussed  the  steps  appropriate  for  this  situation. 


2.3  Empirical  Contents  in  a  Set  of  Definitions 

There  is  a  more  systematic  reason  why  the  search  for  equations  should  continue, 
even  if  it  has  been  successful.  Equations  that  are  used  to  compute  missing  values 
are  empirical  generalizations.  Although  they  may  be  reliable,  we  cannot  trust 
them  unconditionally,  and  it  is  a  good  practice  to  seek  their  further  verification, 
especially  if  they  are  applied  to  the  expanded  range  of  values  of  a.  The  veri¬ 
fication  may  come  from  additional  knowledge  that  can  be  used  as  alternative 
definitions.  Ras  et  al.  [12],  [13]  used  rules  coming  from  various  sites  and  verified 
their  consistency. 

Multiple  equations  give  a  chance  for  cross- verification,  as  their  predictions 
can  be  compared.  Each  consistent  prediction  provides  extra  justification  for  the 
system  of  definitions,  while  each  inconsistency  calls  for  further  empirical  analysis 
of  data  and  definition  improvements. 


2.4  Recursive  Discovery  Algorithm 

The  following  algorithm  can  be  used  to  search  recursively  for  an  attribute  defi¬ 
nition: 

Algorithm:  Find  definitions  of  attribute  a  that  are  applicable  in  DB 

Let  A  be  the  set  of  attributes  in  DB 
For  each  database  X 

Let  Ax  be  the  set  of  attributes  in  both  A  and  X 

If  a  is  available  in  X  and  Ax  is  not  empty,  then 

seek  all  definitions  of  a  in  Ax',  push  them  on  def(a,A) 

If  def(a,A)  is  non-empty,  then  HALT 
For  each  X 

If  a  is  available  in  X,  then 

seek  all  definitions  of  a  in  A;  push  them  on  def(a,X) 

For  each  definition  DEF  in  def(a,X) 

For  each  attribute  b  in  DEF  that  is  missing  in  DB 

Find  definitions  of  attribute  b  that  are  applicable  in  DB 
If  all  attributes  in  DEF  are  defined  by  attributes  in  A 
then  add  DEF  to  list  of  definitions  of  a 
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2.5  An  Example  of  Recursive  Search  for  a  Definition 

Consider  the  following  database  schemas 

Si(ai,a2,a3), 

S2{a,a1,b), 

S3{a,b,a3), 

Si  (a2,  b) 

also  illustrated  in  Figure  2.  Discovery  layer  is  assigned  to  each  of  these  four 
databases  and  contains  definitions  of  attributes  and/or  attribute  values  extracted 
from  them.  Definition  of  attribute  a  is  sought. 


a=f2(a1  ,b) 

a=f  1  (a3)  | 

b=f3(a2) 

Discovery 

Discovery 
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Fig.  2.  Search  for  equations  in  support  of  query  answering;  an  example 


The  recursive  search  for  a  definition  follows  these  steps: 

1.  Si  sends  a  request:  define  a,  use  ai,a2,a3 

2.  S2  and  S3  try  to  answer. 

3a.  Situation  1:  definition  found  in  S3:  a  =  /i(a3) 

3b.  Situation  2:  no  definition  found,  so  the  search  is  expanded  to  additional 
attributes:  define  a,  use  01, 02,03  and  any  other  parameters  available. 

4.  S2  tries  to  answer. 

5a.  Situation  3:  definition  not  found  so  the  search  halts. 

5b.  Situation  4:  definition  found  in  S2:  o  =  f 2(0,1,  b) 

6.  A  new  quest  is  issued:  define  b,  use  04,03,03 

7.  S3  and  S4  try  to  answer. 

8a.  Situation  5:  definition  not  found  so  the  search  halts. 

8b.  Situation  6:  definition  found  in  S4:  b  =  /3(a2) 

9.  Equation  in  8b  is  substituted  into  equation  in  5b:  a  =  f2(a\,  /3(o2)) 

10.  The  search  halts. 
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3  Query  Answering  System  Based  on  Reducts 

In  this  section  we  recall  the  notion  of  a  reduct  and  show  how  it  can  be  used  to  im¬ 
prove  the  query  answering  process  in  distributed  autonomous  database  systems 
(DADS).  We  assume  that  information  stored  in  all  databases  is  consistent. 

Let  us  assume  that  S  =  S(A)  is  a  database  schema  and  S(X,  A)  represents 
its  view.  Each  a  6  A  is  interpreted  here  as  a  function  a  :  X  — >  Dom(a),  where 
Dom(a)  is  a  domain  of  a.  For  simplicity  reason  we  assume  that  Dom(a) ,  Dom(b) 
are  disjoint  for  any  a,b  £  A  such  that  a^b. 

Let  B  c  A.  We  say  that  x,  y  €  X  are  indiscernible  by  B  in  S,  denoted 
[x  y ],  if  (Va  £  B)[a( x)  =  a(y)}. 


Fig.  3.  Process  of  resolving  a  query  by  QAS  in  DADS 
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Now,  assume  that  both  B\,Bo  are  subsets  of  A.  We  say  that  B\  depends 
on  £>2  if  Also,  we  say  that  B2  is  a  reduct  of  B 1  (Bi-reduct)  if  B\ 

depends  on  B2  and  B2  is  minimal.  If  B  is  a  singleton  set  (B  =  {/})  then  instead 
of  5-reduct  we  say  /-reduct. 


Example.  Assume  the  following  scenario: 

-Si  -  (Xi  ,{c,d,e,g}),  S2  =  (X2,{a,b,c,d,f}),  S3  =  (. X3,{b,e,g,h }) 
are  views  of  database  schemas  Si,S2,S3,  respectively. 

—  User  submits  a  query  q  =  q(c,e,f)  to  the  query  answering  system  QAS 
associated  with  database  Si, 

—  Databases  Si,  S2,  S3  form  a  distributed  autonomous  database  system 
DADS. 

Attribute  /  is  non-local  for  a  database  Si  so  the  query  answering  system 
associated  with  Si  has  to  contact  other  sites  of  DAKS  requesting  a  definition 
of  /  in  terms  of  {d,  c,  e,  g}.  Such  a  request  is  denoted  by  <  /  :  d,c,e,g  >. 
Assume  that  the  database  S2  is  contacted.  The  definition  of  /,  extracted  from 
S2,  involves  only  attributes  {d,c,e,g}  n  {a,b,c,d,  f}  =  {c,d}.  There  are  three 
/-reducts  (coverings  of  f)  in  S2.  They  are:  (a,  £>},  {a,  c},  {b,  c}.  The  optimal  /- 
reduct  is  the  one  which  has  minimal  number  of  elements  outside  {c,  d};  in  our 
case  {a,  c}  and  {b,  c}.  Let  us  assume  that  {b,  c}  is  chosen  as  an  optimal  /-reduct 
in  S2. 

Then,  the  definition  of  /  in  terms  of  attributes  {6,  c}  may  be  extracted  from 
S2  and  the  query  answering  system  of  S2  will  contact  other  sites  of  DADS 
requesting  a  definition  of  b  (which  is  non-local  for  Si)  in  terms  of  attributes 
{d,  c,  e,g}.  If  a  definition  of  b  is  found,  then  it  is  sent  to  QAS  of  Si.  Figure  4 
shows  the  process  of  resolving  query  q  in  the  example  above. 


4  Conclusion 

The  discovery  layer  at  each  site  is  formed  from  partial  definitions  either  extrac¬ 
ted  from  that  site  or  imported  from  other  sites.  All  these  partial  definitions  (if 
consistent)  define  a  local  operational  semantics  and  the  meaning  of  queries  seen 
by  the  local  query  answering  system.  Discovery  processes  update  discovery  lay¬ 
ers  associated  with  all  databases  in  DADS.  They  do  it  in  real-time  whenever  a 
local  query  cannot  be  answered  with  the  help  of  local  operational  semantics.  As 
a  result  of  operational  definitions  discovery,  local  semantics  is  augmented  with 
a  relevant  selection  of  definitions  found  in  DADS. 
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Abstract.  Focuses  in  KDD  research  are  being  extended  from  individual 
techniques  to  KDD  process,  while  KDD  systems  have  been  rapidly  evolv¬ 
ing  from  the  early  stand-alone  ones  to  the  current  large,  Distributed  KDD 
(DKDD)  systems.  In  this  paper,  we  concentrate  on  the  architectural  as¬ 
pect  of  Distributed  KDD  systems  from  the  perspective  of  CSCW  (treat¬ 
ing  DKDD  as  a  special  case  of  CSCW).  After  summarizing  the  require¬ 
ments  needed  to  support  Distributed  KDD,  we  describe  a  Client/Server 
architecture  for  DKDD  that  is  “traditional”  for  CSCW  in  general.  Then 
we  propose  a  Multi-Agent  (MAS)  based  architecture  for  DKDD.  Com¬ 
pared  with  the  traditional  Client/Server  architecture,  the  MAS  based 
architecture  is  better  in  terms  of  simplicity  and  flexibility,  and  particu¬ 
larly  useful  in  modeling  and  providing  support  to  cooperative  activities 
(communication,  negotiation,  coordination  and  collaboration). 


1  Introduction 

Data  Mining  or  Knowledge  Discovery  in  Databases  (KDD)  means  discovering 
new,  useful  knowledge  (called  models  in  this  paper)  from  vast  amount  of  data 
accumulated  in  an  organization’s  databases.  KDD  process  is  the  set  of  activities 
needed  to  transform  raw  datasets  into  usable  models.  In  real-world  applications, 
KDD  process  is  an  integrated  part  of  a  whole  business  process,  with  activities 
such  as  data  sampling,  pre-processing,  mining  (model  building),  model  analysis, 
visualization  and  integration  into  the  business  process.  Now  it  is  well-recognized 
that  real-world  KDD  process  can  be  very  complex,  similar  in  many  aspects  to 
Software  Development  Process  [Hum89], 

KDD  is  essentially  a  demand-driven  field.  Although  early  work  in  KDD  in¬ 
evitably  concentrated  on  individual  mining  techniques,  what  really  important  is 
the  KDD  systems  combining  various  KDD  techniques  and  their  successful  ap¬ 
plications  to  real-world  databases.  KDD  systems  have  rapidly  evolved  [Ce99], 
While  the  first  generation  of  KDD  systems  was  stand-alone  mining  applications 
over  files,  the  second  generation  has  been  integrated  with  data  management, 
and  the  third  (current)  generation  is  characterized  by  distribution  of  data  and 
computation  over  enterprises’  Intranets  or  across  the  global  Internet.  We  will 
call  this  kind  of  KDD  systems  Distributed  KDD  Systems  -  DKDD.  Further  de¬ 
velopment  of  DKDD  systems  includes  dynamically  adding  new  computational 
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resources  to  a  network,  and  the  mobility  of  code  (for  example,  mining  compo¬ 
nents  move  to  DBMS  sites  and  execute  within  the  databases).  Thus,  we  can 
view  the  Distributed  KDD  as  a  special  case  of  CSCW  (Computer-Supported  Co¬ 
operative  Work)  that  is  a  multidisciplinary  research  area  focusing  on  effective 
methods  of  sharing  information  and  coordinating  activities  [Gru94]. 

In  recent  years,  along  with  developing  new  KDD  techniques,  we  pay  increas¬ 
ing  attention  to  the  process  and  architecture  aspects  of  KDD  systems.  In  or¬ 
der  to  increase  both  autonomy  and  versatility  of  KDD  systems,  we  proposed  a 
Global  Learning  Scheme  (GLS)  [?,  ZL097],  as  a  framework  for  organizing  com¬ 
plex  KDD  process.  GLS  has  two  levels:  the  meta-level  of  process  and  the  object 
level  of  process.  On  the  meta-level,  it  provides  mechanisms  and  facilities  for  mod¬ 
eling,  planning,  scheduling,  controlling  and  management  of  KDD  process;  On  the 
object  level,  various  mining  components  (methods  and  algorithms)  are  grouped 
according  to  the  different  stages  of  data  mining.  Within  this  framework,  we  have 
been  investigating  the  planning  meta  process  in  depth  [ZLK097,  LZ097].  In  par¬ 
ticular,  we  propose  to  handle  iterations  in  KDD  process  by  integrating  planning 
and  execution  [LZ098],  and  to  deal  with  KDD  process  changes  by  incremental 
replanning  [ZLK098], 

In  this  paper,  we  concentrate  on  the  architectural  aspect  of  Distributed  KDD 
(DKDD)  systems  from  the  perspective  of  CSCW  (treating  DKDD  as  a  spe¬ 
cial  case  of  CSCW).  First  we  summarize  the  requirements  needed  to  support 
Distributed  KDD.  Then  we  describe  a  Client/Server  architecture  for  DKDD 
that  is  “traditional”  for  CSCW  in  general  and  based  on  the  current  status 
of  Internet  technology.  Then  we  propose  a  Multi-Agent  (MAS)  based  archi¬ 
tecture  for  DKDD,  which  is  adapted  from  the  generic  MAS  architecture  for 
CSCW  [MMP98,  WCL99].  Compared  with  the  traditional  architecture,  the  agent- 
based  architecture  is  better  in  terms  of  simplicity  and  flexibility,  and  particularly 
useful  in  modeling  and  providing  support  to  cooperative  activities  (communica¬ 
tion,  negotiation,  coordination  and  collaboration).  The  proposed  architecture 
can  model  the  information  flow  as  well  as  the  process  flow  in  DKDD.  Finally  in 
the  Conclusion  section,  implementation  issues  are  also  briefly  discussed. 

2  Requirements  for  Distributed  KDD  Architecture 

Distributed  KDD  (such  as  Enterprise  Distributed  Data  Mining  [Ce99])  faces 
unique  challenges  and  needs  architectural  support  to  cope  with  them.  The  re¬ 
quirements  for  architectural  support  to  Distributed  KDD  can  be  summarized  as 
follows. 

-  Multiple  roles:  Unlike  simple,  stand-alone,  prototype  KDD  work,  real-world 
KDD  process  involves  multiple  human  roles.  We  can  identify  at  least  three 
types  of  them:  the  analysts  (for  KDD  task  planning  and  result  analysis),  the 
knowledge  engineers  (executing  the  mining  tasks),  and  the  end-users  (people 
managing  and  optimizing  the  business  process  within  that  the  KDD  process 
occurs).  Multiple  people  may  access  to  the  data  and  the  analytical  results 
(the  models),  so  the  KDD  system  must  provide  multiple  access  points. 
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(NB:  in  this  paper,  the  “user”  of  KDD  systems  mainly  refers  to  analysts  or 
knowledge  engineers). 

—  Mining  on  data  of  huge  size:  Gigabytes  or  even  terabytes  of  data  have  been 
accumulated  in  large  organizations.  Mining  on  such  large  scale  of  data  has 
the  following  implications  on  KDD  architecture: 

•  We  need  large  computational  power  (high-performance  servers)  for  min¬ 
ing  tasks,  and  visualization  tools  for  data  analysis  and  model  analysis. 

•  The  mining  operation  should  be  run  close  to  the  databases,  because  it 
is  not  practical  to  move  the  vast  data  between  the  sites  of  individual 
analysts.  This  requirement  can  be  supported  either  by  mobile  mining 
components  traveling  to  the  database  sites  and  executing  there,  or  by 
setting  up  high-performance  servers  close  to  databases. 

•  The  user  should  be  allowed  to  browse  and  sample  data  during  planning 
and  editing  his/her  mining  tasks. 

—  Mining  on  diverse  and  distributed  data  sources:  Because  various  types  of  data 
are  accumulated  on  many  sites  in  a  large  organization.  A  user  may  need  to 
access  to  multiple  datasets.  So  the  KDD  system  must  support  distributed 
mining  and  combining  partial  results  into  a  meaningful  total. 

—  KDD  process  planning:  There  are  several  stages  in  KDD  process  (the  three 
major  stages  are  :  pre-processing,  model-building  and  model  analysis  and 
refinement).  For  each  stage,  there  is  a  large  number  of  available  KDD  tech¬ 
niques  and  algorithms.  Some  of  them  may  be  out-of-date  soon  while  new 
ones  come  continuously.  So,  good  combination  of  KDD  techniques  and  easy 
integration  with  new  techniques  are  very  desirable,  and  this  demands  careful 
planning  of  the  KDD  tasks.  Note  that  from  different  kinds  of  data  resources, 
different  KDD  techniques  are  needed,  so  the  planning  involves  browsing  and 
sampling  data. 

—  Interactions  among  KDD  roles:  Because  the  KDD  process  is  iterative  through 
the  cycle  of  data-selection,  pre-processing,  model  building,  and  model  anal¬ 
ysis  and  refinement,  high  degree  of  interactions  among  analysts,  knowledge 
engineers  and  the  end-users  is  needed. 

—  Flexibility:  Wide  range  of  configuration  options  is  needed  to  fulfill  different 
needs  of  large  organizations,  so  that  the  applications  can  be  scaled  from  a 
few  client  workstations  to  high-performance  server  machines. 

—  Open-ended-ness  for  future  extension. 

—  Conceptual  and  architectural  simplicity  is  important  in  designing  such  a  com¬ 
plex  system  to  ensure/enhance  its  correctness,  flexibility  and  openness,  etc. 

On  the  implementation  level,  the  rapid  development  of  Internet  and  related 
technologies  such  as  software  component  technology  and  various  Java/CORBA 
packages  do  provide  solutions  to  Distributed  KDD.  But  on  the  design  level,  we 
need  conceptual  and  architectural  clarity  and  simplicity  for  complex  systems  like 
Distributed  KDD  systems.  And  this  is  the  focus  for  the  remaining  part  of  this 
paper. 
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3  A  Client/Server  Architecture  for  Distributed  KDD 

We  regard  Distributed  KDD  (DKDD)  as  a  special  case  of  Computer-Supported 
Cooperative  Work  (CSCW),  trying  to  apply  the  generic  Client/Server  architec¬ 
ture  of  CSCW  [LC98,  WCL99]  to  DKDD.  We  also  investigate  the  architectures 
of  some  existing  Distributed  KDD  systems  such  as  [Ce99]  (though  mainly  on 
the  implementation  level).  As  a  result,  we  can  describe  here  the  “traditional” 
Client/Server  architecture  for  Distributed  KDD  systems  as  Figure  1. 


Browsing/Sampling  Data 
client-1  KDD  Process  Planning 

Model  Analysis/Refinement 
Visualization 


Browsing/Sampling  Data 
client .?  KDD  Process  Planning 

Model  Analysis/Refinement 
Visualization 


Fig.  1.  A  Client/Server  Architecture  of  Distributed  KDD  Systems 


On  the  server  side,  there  could  be  multiple  KDD  Server  sites,  as  well  as  mul¬ 
tiple  DBMS  sites  and  HPC  (High-Performance  Computation)  sites,  distributed 
globally  (that  is,  they  may  be  allocated  across  the  whole  Internet). 


A  Multi-agent  Based  Architecture  595 


Obviously,  a  DBMS  site  provides  the  traditional  database  service.  On  the 
other  hand,  a  HPC  site  provides  parallel  mining  algorithms  executed  on  a  dedi¬ 
cated  separate  parallel  machine.  The  algorithms  include  traditional  ones  for  clas¬ 
sification,  clustering,  and  association  rule  discovery,  as  well  as  new  techniques 
such  as  Inductive  Logic  Programming  (ILP).  Full  range  of  statistical  service  is 
also  available.  As  we  mentioned  before,  a  HPC  site  should  be  set  up  close  to  the 
Database  site  from  where  the  data  is  mined. 

Each  KDD  server  consists  of  three  main  components: 

1.  Object  Manager.  Manages  persistent  objects  such  as  data  mining  tasks  and 
results  (models)  for  various  users,  authenticates  users  and  controls  users’ 
access  to  the  persistent  objects. 

2.  Mining  Manager.  Provides  the  interface  with  a  HPC  site  via  CORBA,  etc. 
This  component  controls  data  conversion,  data  transfer,  and  parameter  pass¬ 
ing  to  the  mining  algorithms. 

3.  Data  Cache:  Provides  the  interface  with  a  Database  via  JDBC.  The  loaded 
and  cached  data  are  used  for  browsing  the  Database  from  the  client. 

On  the  client  side,  there  could  be  multiple  KDD  Client  sites,  also  across 
the  whole  Internet.  KDD  client  sites  communicate  with  each  other  and  with 
KDD  Servers.  Connections  between  KDD  client  sites  and  DBMS/HPC  sites  are 
indirect  via  KDD  Server  sites.  A  KDD  client  has  the  following  components  (the 
first  three  interact  with  KDD  Servers,  the  others  are  for  internal  work): 

—  Object  Browser.  Viewing  the  user’s  objects  held  in  the  Object  Manager  of 
KDD  Server. 

—  KDD  Task  Invoker:  Submitting  the  planned  and  edited  KDD  tasks  via  the 
Mining  Manager  of  KDD  Server,  which  in  turn  asks  the  service  of  HPC  site. 
Meanwhile  some  simple  KDD  tasks  using  only  local  datasets  can  also  be 
executed  locally  on  the  client  site  (see  items  below). 

—  Data  Browser.  Providing  the  function  of  browsing  remote  databases,  by  ac¬ 
cess  to  the  Data  Cache  of  KDD  Server. 

—  KDD  Process  Planner:  Tries  to  make  a  plan  for  the  KDD  process.  A  plan 
is  a  partially  ordered  network  of  mining  activities  needed  for  pre-processing, 
model-building,  model  analysis  and  refinement.  Remote  data  browsing  and 
sampling  are  needed  for  the  planner  to  select  relevant  data,  to  choose  appro¬ 
priate  mining  techniques,  and  to  organize  them  into  a  KDD  process  plan. 

—  Visualization  Tool:  For  large-scale  mining  tasks,  visualization  is  necessary 
for  data  analysis  as  well  as  model  analysis. 

—  Work  Space:  The  place  where  the  client  works.  A  group  of  KDD  clients  may 
have  shared  work  spaces. 

There  is  a  special,  coordinating  KDD  client  (for  the  manager  of  the  data 
mining  group  in  an  organization,  for  example)  with  the  following  functionalities: 

—  Cooperative  Planning:  Partitioning  the  overall  KDD  tasks  to  a  group  of 
knowledge  engineers  (KDD  clients). 
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—  Scheduling:  Distributing  KDD  tasks  planned  and  submitted  by  other  KDD 
clients  to  appropriate  KDD  servers.  The  selection  of  KDD  servers  is  based 
on  some  resource  allocation  policy  (for  example,  to  execute  the  mining  op¬ 
erations  in  a  KDD  server  that  contains  the  data). 

—  Synthesis:  Combining  partial  results  from  individual  clients  into  a  meaningful 
total. 

4  Multi- Agent  Based  Architecture  for  Distributed  KDD 

The  previous  section  shows  how  complex  a  DKDD  environment  could  be.  Similar 
situations  exist  in  other  Distributed  Artificial  Intelligence  (DAI)  systems  [ONJ96], 
and  in  Cooperative  Software  Engineering  (CSE)  [WCL99],  They  are  all  examples 
of  Computer-Supported  Cooperative  Work  (CSCW)  [Gru94,  LC98].  Nowadays 
it  is  widely  agreed  that  the  Multi-Agent  Systems  (MAS)  [MMP98]  are  a  better 
way  to  model  these  decentralized,  distributed,  open-ended  systems  and  environ¬ 
ments.  A  MAS  is  a  loosely-coupled  networks  of  problem  solvers  (agents)  that 
work  together  to  solve  a  given  problem. 

In  this  section  we  propose  a  M AS-based  architecture  for  Distributed  KDD, 
which  is  inspired  by  our  M AS-based  architecture  for  CSE  [WCL99].  In  such  a 
Multi-Agent  architecture,  we  have  the  following  main  components: 

—  Agents: 

An  agent  is  a  piece  of  software  created  by  and  acting  on  behalf  of  the  user 
(or  some  other  agent).  It  is  set  up  to  achieve  a  modest  goal,  with  the  char¬ 
acteristics  of  autonomy,  interaction,  reactivity  to  environment,  as  well  as 
pro-activeness.  We  have  the  following  different  types  of  agents  in  the  DKDD 
architecture:  Assistant  agents  assisting  KDD  people  in  various  work,  such 
as  browsing  and  sampling  data,  planning  KDD  process,  analyzing/refining 
models,  etc.  Interacting  agents  helping  participants  in  their  cooperative  work 
such  as  communication,  negotiation,  coordination,  and  mediation.  Mobile 
agents  as  mining  software  that  can  move  to  DBMS  sites  and  execute  within 
the  databases  (in  order  to  reduce  data  transfer).  And  System  agents  for  the 
administration  of  the  multi-agent  architecture  such  as  to  register  and  man¬ 
age  large  numbers  of  components;  to  monitor  events  in  and  status  of  the 
Workspaces  as  well  as  the  AgentMeetingPlaces  (see  below),  and  to  collect 
relevant  measurement  according  predefined  metrics. 

—  AgentMeetingPlaces  (AMPs): 

AgentMeetingPlaces  are  where  agents  meet  and  interact  with  each  other 
(for  communication,  negotiation,  coordination  and  collaboration).  AMPs  are 
built  on  the  underlying  communication  mechanisms,  but  must  provide  agents 
with  more  intelligent  means  to  facilitate  their  interaction.  First  of  all,  AMPs 
should  provide  agent  communication  languages  such  as  KQML  [Fe97],  defin¬ 
ing  communication  types  and  the  common  syntax  of  the  messages  trans¬ 
mitted.  AMPs  must  also  provide  a  set  of  information  models,  by  which  a 
recipient  can  understand  what  the  message  means.  For  negotiation  agents, 
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the  purpose  is  to  reach  an  agreement.  The  progress  of  negotiation  depends 
mainly  on  the  negotiation  strategies  employed  by  the  negotiating  agents, 
but  MeetingPlaces  should  provide  mechanisms  to  minimize  communication 
overheads,  to  help  the  negotiating  agents  in  minimizing  their  computation 
efforts. 

—  Workspaces  (WSs): 

Just  as  a  software  engineer  needs  a  Workspace  (WS)  to  perform  software 
development  tasks,  a  knowledge  engineer  needs  a  WS  to  perform  mining 
tasks.  A  WS  is  primarily  a  container  (often  realized  as  a  storage  area  in 
a  file  system,  in  a  web-directory,  or  in  a  sub-database)  for  relevant  data 
in  a  suitable  format,  together  with  the  processing  tools.  The  relevant  data 
include  sampled  source  data  as  well  as  KDD  process  data  (the  KDD  process 
plan,  etc).  The  processing  tools  include  KDD  algorithms  for  experiments 
on  the  sampled  data,  KDD  process  planner,  and  visualization  tools.  In  a 
large  organization,  there  may  be  a  KDD  division  involving  a  number  of 
analysts  and  knowledge  engineers.  The  KDD  people  work  in  groups.  Each 
group  may  have  a  shared  Workspace  while  each  knowledge  engineer  having 
his/her  private  WS.  That  is,  the  contents  of  Workspaces  may  overlapped 
with  data  shared  by  several  people. 

—  Repositories: 

The  architecture  contains  various  Repositories  (global,  local  or  distributed). 
The  most  fundamental  one  is  the  databases  to  mine  and  the  model  repos¬ 
itories  to  store  the  results  of  data  mining.  There  could  be  other  important 
repositories.  For  example,  in  a  large  organization  with  frequent  KDD  ac¬ 
tivities,  it  is  useful  to  maintain  an  Experience  Base  (EB).  The  idea  is  that 
for  each  (previous)  mining  project,  its  project/operative  profile  is  recorded 
in  EB.  All  new  projects  are  initiated  with  a  specific  own  project/operative 
profile.  The  project  manager  uses  it  as  the  criteria  to  search  EB  for  those 
previously  completed  projects  that  are  similar  to  the  current  project.  The 
best-matching  candidate  is  then  selected  as  the  baseline  project.  With  the 
baseline  project  as  a  guidance,  the  KDD  process  planning  (hence  the  quality) 
of  the  new  project  could  be  improved  in  various  ways. 

—  High-Performance  Computers  (HPCs): 

A  HPC  in  this  context  means  a  parallel  machine  dedicated  to  mining  tasks. 
This  is  necessary  when  the  size  of  data  to  be  mined  is  so  big  that  it  becomes 
unrealistic  to  perform  the  real  data  mining  tasks  in  the  Workspaces. 

Within  our  architecture,  the  main  components  are  interconnected  and  inter¬ 
operated  as  follows: 

1.  Set  up  Workspaces  according  to  people  grouping:  In  large  KDD  process,  we 
can  perceive  various  groups  of  people  working  as  a  team.  Each  KDD  people 
has  his/her  private  Workspace,  and  the  group  has  a  share  Workspace.  There 
are  controlling  WSs  where  the  managing,  controlling  and  scheduling  people 
(or  their  agents)  work.  They  are  responsible  for  the  overall  cooperative  plan¬ 
ning  for  the  current  KDD  project  (while  other  KDD  people  plan  their  own 
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part  of  KDD  tasks),  for  accessing  to  global  databases,  and  for  distributing 
real  mining  tasks  to  HPCs  according  to  some  resource  allocation  policy. 

2.  Create  agents:  Assistant  agents  are  created  by  people  to  help  them  work; 
Interacting  agents  are  created  for  communication  purpose;  Mobile  agents 
are  created  and  sent  to  perform  data  mining  tasks  within  databases  sites; 
and  system  agents  are  created  by  default  to  manage  various  components  in 
the  architecture.  Note  that  agents  creation  is  a  process  of  instantiation  of 
the  corresponding  agent  classes.  That  is,  agents  are  created  from  templates, 
typically  by  supplying  few  parameters. 

3.  Communication  between  agent  groups  is  via  AgentMeetingPlaces  (AMPs). 
Some  system  agents  are  created  by  default  to  manage  the  AMP  (creation, 
deletion,  and  bookkeeping). 

4.  Repositories  can  be  global,  local  to  one  people  or  a  group  of  people,  or  dis¬ 
tributed.  For  the  databases  and  model  repositories,  we  mention  that:  (1). 
Local  databases  and  model  repositories  can  be  accessed  from  associated 
Workspaces;  (2).  Global  ones  can  be  accessed  only  from  controlling  WSs 
where  the  controlling  and  scheduling  people  (or  their  agents)  work.  (3).  In 
the  architecture,  HPC  components  are  optional,  and  set  close  to  the  rele¬ 
vant  global  database  components.  (4).  Mobile  agents  can  travel  to  global 
databases  and  execute  there. 

5.  Within  a  Workspace,  any  existing  KDD  process  models  are  allowed.  For 
example,  the  planning  and  replanning  techniques  described  in  [ZLK097, 
LZ097]  can  be  applied  to  them. 

Figure  2  shows  some  key  issues  in  the  architecture  about  the  interconnection 
and  interoperation  among  the  five  components.  To  prevent  the  figure  from  being 
too  complicated,  we  do  not  show  every  details  of  the  architecture  described 
above.  The  features  shown  in  the  figure  include:  Workspaces  and  their  local 
KDD  process  and  local  databases;  The  controlling  WS  and  its  access  to  the  global 
database;  Mobile  agents  traveling  to  the  global  database;  The  close  neighborhood 
of  the  global  database  and  the  HPC;  And  interacting  agents  communicating  via 
an  AgentMeetingPlace.  When  the  normal  WS  completes  a  KDD  plan  and  tries 
to  perform  real  mining  tasks  on  the  global  database,  it  communicates  with  the 
controlling  WS  by  interacting  agents,  and  the  latter  schedules  the  tasks,  accesses 
to  the  global  database,  caches  relevant  data,  and  invokes  the  nearby  HPC  to 
perform  the  mining  tasks.  The  results  will  be  passed  to  the  normal  WS  also  via 
interacting  agents.  The  normal  WS  can  also  send  a  mobile  agent  to  the  global 
database  and  let  it  execute  there. 

Compared  with  the  “traditional”  Client/Server  architecture,  we  can  mention 
the  main  advantages  of  MAS-based  architecture,  when  applied  to  Distributed 
KDD,  as  follows: 


—  Decentralization:  being  able  to  break  down  a  complex  system  into  a  set  of  de¬ 
centralized,  cooperative  subsystems.  Here  we  may  have  distributed  databases 
and  unbounded  numbers  of  agents,  Workspaces  and  AgentMeetingPlaces. 
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—  Reuse  of  previous  componenis/subsy  stems:  That  is,  building  anew  and  possi¬ 
bly  larger  system  by  interconnection  and  interoperation  of  existing  (sub)systems, 
even  though  they  are  highly  heterogeneous. 

—  Cooperative  Work  Support:  being  able  to  better  model  and  support  the  spec¬ 
trum  of  interactions  (communication,  negotiation,  coordination  and  collab¬ 
oration)  in  cooperative  work. 

-  Flexibility:  being  able  to  cope  with  the  characteristic  features  of  a  distributed 
environment  such  as  DKDD,  namely  incomplete  specification,  evolution,  and 
open-endedness. 

-  Simplicity:  being  able  to  offer  conceptual  clarity  and  simplicity  in  modeling 
and  design. 


5  Conclusions 

This  paper  investigated  the  architectural  aspects  of  Distributed  KDD  systems, 
viewing  DKDD  as  a  special  case  of  CSCW.  We  listed  the  the  requirements  needed 
to  support  Distributed  KDD,  described  a  Client/Server  architecture  for  DKDD 
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that  is  “traditional”  for  CSCW  in  general,  and  proposed  a  Multi- Agent  (MAS) 
based  architecture  for  DKDD.  Compared  with  the  traditional  Client/Server  ar¬ 
chitecture,  the  MAS  based  architecture  is  better  in  terms  of  simplicity  and  flex¬ 
ibility,  and  particularly  useful  in  modeling  and  providing  support  to  cooperative 
activities  (communication,  negotiation,  coordination  and  collaboration). 

In  terms  of  implementation,  both  architectures  rely  on  Internet  related  tech¬ 
nologies  such  as  software  component  technology  and  various  Java/CORBA  pack¬ 
ages.  For  the  MAS  based  architecture,  we  also  need  communicating  agents  and 
mobile  agents.  On  this  part,  the  rapid  development  of  agent  technology  provides 
many  options  for  implementation,  such  as  KQML  [Fe97]  for  agent  communi¬ 
cation,  Aglets  [L098]  for  mobile  agents,  etc.  Prototyping  work  based  on  these 
techniques  is  under  way. 
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Abstract.  We  present  a  software  architecture  model  of  adaptation  in 
CBR.  A  software  architecture  is  defined  by  its  components  and  their 
connectors.  We  present  a  software  architecture  for  CBR  systems  based 
on  three  components  (a  task  description,  a  domain  model,  and  adaptors) 
connected  by  a  type  of  connectors  called  bridges.  Adaptors  are  basic  in¬ 
ference  components  that  perform  specific  transformations  to  cases.  Two 
kinds  of  adaptors  are  introduced:  domain  adaptors  ( d-adaptors )  and  case- 
based  adaptors  ( c-adaptors ).  Adaptors  are  applied  to  a  given  problem, 
performing  search  until  a  sequence  of  adaptor  instantiations  is  found 
such  that  a  solution  is  achieved.  Thus,  in  the  ABC  architecture  adapta¬ 
tion  is  viewed  as  a  search  process  on  the  space  of  adaptors.  The  ABC 
components  have  been  used  in  the  SaxEx  application,  a  CBR  system  for 
generating  expressive  musical  phrases. 


1  Introduction 

The  goal  of  software  architectures  is  learning  from  system  developing  experience 
in  order  to  provide  the  abstract  recurring  patterns  for  improving  further  system 
development.  As  such,  software  architectures  contribution  is  mainly  methodolog¬ 
ical  in  providing  a  way  to  specify  systems.  In  this  paper  we  present  a  software 
architecture  for  adaptation  in  CBR — called  ABC  for  “Adaptors  and  Bridges  as 
Connectors”— based  on  the  notion  of  connectors  and  inspired  on  object-oriented 
and  component-based  methodologies. 

The  three  main  elements  of  the  ABC  software  architecture  are  (i)  a  task 
description — characterizing  the  goal  that  a  CBR  system  pursues;  (ii)  a  domain 
model — characterizing  the  ontology  and  properties  of  the  knowledge  content;  and 
(iii)  a  library  of  adaptors — performing  transformations  to  case-specific  models. 
The  connector  linking  these  three  elements  are  called  bridges. 

*  This  research  has  been  supported  by  the  Project  IST-1999-19005  IBROW  An  Intel¬ 
ligent  Brokering  Service  for  Knowledge-Component-  Reuse  on  the  World-Wide  Web, 
and  the  CICYT  Project  SMASH:  Systems  of  Multiagents  for  Medical  Services  in 
Hospitals. 

Z.W.  Ras  and  S.  Ohsuga  (Eds.):  ISMIS  2000,  LNAI 1932,  pp.  601-609,  2000. 

©  Springer-Verlag  Berlin  Heidelberg  2000 


602  E.  Plaza  and  J.-L.  Arcos 


ABC  follows  the  “problem  solving  as  modeling”  view,  i.e.  solving  a  problem 
consists  of  building  a  model  specific  to  the  problem  that  satisfies  the  task  re¬ 
quirements;  we  call  it  the  case-specific  model.  In  this  view,  a  knowledge  system 
uses  a  domain  model  to  enlarge  the  input  model  until  a  complete  and  correct 
case-specific  model  is  built — where  “complete  and  correct”  are  with  respect  to 
the  requirements  of  the  task. 

We  have  considered  two  kinds  of  adaptors:  domain  adaptors  ( d-adaptors )  and 
case-based  adaptors  [c-adaptors) .  D-adaptors  use  some  domain-specific  knowl¬ 
edge  to  transform  the  case-specific  model  (in  a  way  specified  by  the  adaptor’s 
competence).  C-adaptors  also  transform  the  case-specific  model  but  use  domain 
knowledge  that  includes  precedent  cases  retrieved  from  case  memory. 

Adaptors  are  applied  to  the  case-specific  model,  performing  search  until  a 
sequence  of  adaptor  instantiations  is  found  such  that  transforms  the  initial  case- 
specific  model  into  a  correct  case-specific  model  that  satisfies  the  task  goals. 
Thus,  adaptation  is  viewed  as  a  search  process  on  the  space  of  adaptors.  Since 
new  adaptors  can  be  applied  to  the  first  adapted  object,  several  search  strategies 
(such  as  depth-first,  breadth-first,  and  beam  search)  are  possible. 

We  are  applying  the  ABC  theory  in  the  SaxEx  application[l],  a  complex  real- 
world  case-based  reasoning  system  for  generating  expressive  performances  of 
melodies  based  on  examples  of  human  performances  that  are  represented  as 
structured  cases. 

In  general  terms,  a  software  architecture  describes  the  (i)  components,  (ii) 
connectors,  and  (iii)  a  configuration  of  how  the  components  should  be  connected 
[7].  We  can  consider  CBR  systems  as  a  specific  variant  of  knowledge  systems 
that  furthermore  use  experiential  knowledge  [3].  Because  of  this  we  have  taken 
UPML,  a  software  architecture  being  developed  for  reuse  of  knowledge  systems, 
and  we  are  developing  a  variant  adequate  for  CBR  systems.  The  Unified  Problem 
Solving  Method  Development  Language  UPML  is  currently  in  development  by 
the  I  BROW  consortium,  and  the  first  version  is  currently  released  [6].  Although 
UPML  can  still  have  future  modifications  we  expect  them  to  be  minor  and 
maintain  stable  the  core  ideas. 

The  organization  of  this  paper  is  as  follows.  In  Section  2  we  present  the 
ABC  architecture.  Section  3  describes  how  two  different  families  of  adaptors 
(c-adaptors  and  d-adaptors)  are  incorporated  in  the  Noos  language.  Finally,  in 
Section  4  we  present  the  conclusions  and  discuss  related  work. 


2  The  ABC  architecture 

The  three  main  elements  of  the  ABC  software  architecture  are  (i)  a  task  de¬ 
scription,  (ii)  a  domain  model,  and  (iii)  a  library  of  adaptors.  Figure  1  shows 
these  three  elements  connected  with  a  special  kind  of  connector  called  bridge.  In 
addition,  the  problem  to  be  solved  is  called  input  in  the  figure  and  for  simplicity 
we  will  include  the  case  base  into  the  domain  model  element.  More  specifically, 
we  will  consider  that  each  solved  problem  is  a  model  per  se,  and  we  will  call  it 
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Fig.  1.  The  ABC  software  architecture  consists  of  three  elements:  a  task  description, 
a  domain  model,  and  a  library  of  adaptors.  These  three  elements  are  connected  by 
connectors  called  bridges.  The  problem  to  be  solved  is  called  input  in  this  picture. 


case-specific  model — in  other  words,  it  is  the  model  of  an  episode  of  solving  that 
problem  [3]. 

These  three  elements  are  taken  from  UPML  where  the  main  goal  is  the 
reuse  of  Problem  Solving  Methods  (PSMs);  since  our  goal  is  the  reuse  of  cases 
we  propose  the  specific  architectural  variation  where  adaptors  play  the  role  of 
PSMs.  This  transformation  makes  sense  since  PSMs  are  the  components  that 
perform  the  inferences  for  a  knowledge  system  to  build  the  case-specific  model  of 
the  problem  (i.e.  the  “solution”  to  the  problem).  In  our  approach  the  final  case- 
specific  model  is  build  by  the  adaptors  that  transform  the  case-specific  model 
imported  from  the  case(s)  retrieved  from  the  case-base. 

Tasks,  domain  models,  and  adaptors  are  conceptually  distinct  entities,  al¬ 
though  in  practice  CBR  systems  use  an  implicit  description  of  the  task  and 
domain  knowledge  is  tightly  integrated  with  the  CBR  engine.  From  a  method¬ 
ological  stance,  however,  it  is  better  to  consider  they  separate  and  possibly  com¬ 
ing  from  distinct  sources. 

In  a  similar  manner,  specification  of  tasks  is  also  being  studied  [11]  to  provide 
a  vocabulary  capable  of  describing  tasks  across  a  range  of  different  domains  of 
application,  i.e.  independent  of  the  domain-specific  vocabulary.  For  instance,  a 
task  description  of  diagnosis  [6]  is  specified  in  terms  of  findings  and  hypothesis. 
A  bridge  is  then  needed  to  connect  in  a  meaningful  way  task  descriptions  and 
domain  models.  For  instance,  in  a  medical  diagnosis  domain  the  bridge  from  the 
task  description  to  the  domain  model  maps  findings  and  hypothesis  to  the  terms 
manifestation  and  cause  respectively. 

Therefore  the  methodological  approach  we  are  endorsing  takes  two  main  as¬ 
pects  of  knowledge  modeling  techniques:  explicit  representation  and  conceptual 
separation  of  tasks,  domain  knowledge,  and  adaptors.  From  UPML  software 
architecture  [6]  we  adopt  the  bridge  connectors  among  ABC  architecture  compo¬ 
nents  but  we  change  the  main  elements  of  the  architecture  for  CBR  systems.  In 
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the  rest  of  this  section  we  make  explicit  those  components,  while  in  later  sections 
we  show  a  particular  adaptation  engine  developed  following  this  methodology. 


2.1  The  ABC  components 

Tasks  A  task  provides  a  way  to  characterize  what  a  CBR  system  is  intended  to 

achieve. _ 

Task  Description  consist  of  a  task  name  and 

1.  pragmatics  (author,  explanation,  URL,  last  change  date) 

2.  ontology  (the  vocabulary) 

3.  specification 

—  goals  (expressions  characterizing  the  output  case-models) 

—  preconditions  (expressions  characterizing  valid  input  case-models) 

—  assumptions  (expressions  characterizing  requirements  on  domain  knowledge) 

The  main  elements  for  characterizing  a  task  are  goals,  preconditions  and  as¬ 
sumptions.  These  elements  are  described  in  some  logical  language,  the  option 
of  which  is  open  to  the  designer.  Preconditions  state  constraints  to  be  satisfied 
by  the  problems  to  be  solved  (input  case  models).  Goals  specify  properties  to 
be  satisfied  by  the  solved  problem,  i.e.  by  the  output  case-specific  model.  Fi¬ 
nally  assumptions  determine  assumptions  made  by  the  task  description  upon 
the  content  of  the  domain  model. 

Domain  Models  Domain  models  are  specified  using  a  specific  vocabulary 
(domain  ontology)  and  is  characterized  by  properties,  assumptions,  and  domain 
knowledge. 

Domain  Model  Description  consist  of  a  domain  model  name  and 

1.  pragmatics  (author,  explanation,  URL,  last  change  date) 

2.  ontology  (the  vocabulary) 

3.  specification 

—  properties  (meta-expressions  characterizing  domain  knowledge) 

—  domain  knowledge  (expressions  describing  knowledge) 

—  assumptions  (expressions  characterizing  assumptions  on  domain  model) 

Properties  and  assumptions  both  are  used  to  characterize  the  knowledge 
content  of  a  KB.  These  characteristics  can  be  directly  inferred  from  the  domain 
knowledge  or  can  be  derived  from  requirements  introduced  by  other  components 
of  the  specification.  While  properties  deal  with  characteristics  of  the  knowledge 
content  assumptions  deal  with  external  requirements  like  the  environment  of  the 
system.  As  before,  properties,  assumptions,  and  domain  knowledge  are  expressed 
in  a  specific  formal  language  of  choice. 

Task-domain  bridge  The  td-bridge  is  a  connector  that  translates  (refines) 
the  task  specification  to  a  particular  domain  specified  in  domain-model.  This 
bridge  may  add  assumptions  (on  domain  knowledge)  to  ensure  that  the  transla¬ 
tion  result  is  valid.  The  only  formal  requirement  is  the  union  of  both  task  and 
domain  specifications  is  logically  consistent. 

Adaptors  An  adaptor  is  a  special  kind  of  connector  between  case-specific 
models — i.e.  between  “models  of  cases”. 
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Fig.  2.  The  adaptor  is  a  connector  between  “models  of  cases”,  here  called  case-specific 
models. 


Adaptor  Description  consist  of  an  adaptor  name  and 

1.  pragmatics  (author,  explanation,  URL,  last  change  date) 

2.  ontology  (the  vocabulary) 

3.  specification 

-  preconditions  (expressions  characterizing  valid  input  case-models) 

-  assumptions  (expressions  characterizing  domain  knowledge  needed  by  the 
adaptor  to  succeed) 

-  competence  (expressions  characterizing  the  output  case-models) 

The  preconditions  of  an  adaptor  specify  the  requirements  to  be  satisfied  by 
the  input  case-specific  model  for  the  adaptor’s  result  be  a  valid  one.  The  com¬ 
petence  is  a  description  of  the  transformation  resulting  from  the  application  of 
the  adaptor.  Finally,  the  assumptions  express  the  kind  of  domain  knowledge  the 
adaptor  requires  in  order  to  be  able  to  function.  These  assumptions  may  en¬ 
large  the  requirements  on  domain  knowledge  already  specified  by  the  task.  Since 
the  case  base  is  considered  as  a  specific  type  of  domain  knowledge,  case-based 
adaptation  is  considered  to  be  realized  by  adaptors  that  use  the  experiential 
knowledge  of  the  case  base.  Case-based  adaptors  are  later  discussed  on  §  3. 

Task- Adaptor  bridge  The  ta-bridge  works  like  the  tb- bridge  above  but 
now  is  connecting  the  task  goals  with  the  adaptors  competence.  Since  the  task 
goals  specify  the  conditions  for  a  problem  to  be  correctly  and  completely  solved 
the  problem  solving  process  is  finished  when  an  adaptor  with  a  corresponding 
competence  is  available. 

There  are  different  ways  to  realize  the  ta-bridge  depending  on  the  strategy 
used  to  implement  adaptors.  A  common  strategy  is  designing  a  component  li¬ 
brary  of  adaptors.  Moreover,  depending  on  the  complexity  of  the  application 
domain  the  designers  may  implement  one-shot  adaptors — i.e.  adaptors  with  a 
competence  that  directly  fulfills  the  task  goals.  In  more  complex  situations,  the 
“total”  adaptor  need  to  be  constructed  from  the  elementary  components  in  the 
adaptor  library  to  fit  the  needs  of  each  particular  problem.  This  is  the  implemen¬ 
tation  we  have  chosen  for  the  SaxEx  system[9].  In  this  setting,  adaptation  is  then 
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a  search  problem  over  the  space  of  adaptors  whose  goal  is  finding  a  combination 
of  adaptor  instantiations  such  that  the  final  competence  satisfies,  via  the  bridge, 
the  task  goals. 

From  the  software  architecture  stance  what  is  formally  required  to  estab¬ 
lish  a  ta-bridge  is  only  that  the  adaptor  competence  logically  implies  the  task 
goals.  The  ABC  architecture  does  not  establish  control  constraints  on  the  imple¬ 
mentation  nor  distinguishes  the  situations  where  the  adaptors  already  exist  or 
have  to  be  constructed  from  elementary  adaptor  components  (and  whether  this 
construction  is  automated  or  performed  by  hand). 

Domain-Adaptor  bridge  The  da-bridge  connects  the  assumptions  upon 
domain  knowledge  specified  by  the  adaptor  with  the  domain  model,  in  a  sim¬ 
ilar  way  to  how  td-bridge  maps  task  assumptions  to  the  domain  model.  Some 
requirements  of  PSM  upon  domain  models  have  already  been  established  by 
connecting  method  with  task  and  task  with  a  domain.  Now  we  only  need  to  map 
the  knowledge  requirements  that  are  exclusively  for  the  method. 

2.2  ABC  and  CBR  systems  design 

The  very  idea  of  software  architectures  is  learning  from  system  developing  ex¬ 
perience  in  order  to  provide  the  abstract  recurring  patterns  for  improving  fur¬ 
ther  system  development.  As  such,  software  architectures  contribution  in  mainly 
methodological  in  providing  a  way  to  specify  systems.  If  we  consider  existing 
CBR  systems  and  the  ABC  architecture  we  can  observe  that  ABC  is  making 
explicit  issues  that  CBR  system  developers  already  know  but  treat  implicitly 
when  developing  new  systems  and  that  they  are  not  explicit  either  on  the  actual 
CBR  system.  Let’s  take  the  task  of  a  CBR  system,  for  instance.  The  specific 
task  a  CBR  system  has  always  to  be  specified,  albeit  informally,  in  the  system 
design  phase.  The  ABC  approach  considers  this  a  specification  of  the  task  but 
also  provides  a  specific  way  to  relate  that  specification  to  each  other  component 
of  the  architecture:  preconditions  relate  to  the  input  problem,  assumptions  re¬ 
lates  with  the  availability  of  knowledge,  and  goals  relate  to  the  search  process 
performed  by  the  CBR  system. 

Furthermore,  let  us  consider  domain  knowledge.  Some  CBR  systems  use  cases 
(and  similarity)  as  the  unique  source  of  knowledge  available  to  solve  problems — 
e.g.  instance-based  learning  approaches.  However,  a  great  number  of  CBR  sys¬ 
tems  use  domain  knowledge,  for  different  purposes  and  in  different  ways,  in 
addition  to  cases.  Commonly,  this  domain  knowledge  is  not  described  as  such, 
but  it  is  described  by  explaining  the  implementation  of  the  CBR  system.  In 
other  words,  what  is  described  is  the  representation  used  to  encode  it  (rules, 
constraints)  and  the  role  it  plays  in  the  system  implementation  (mainly  con¬ 
cerning  control  issues).  It  is  our  personal  opinion  that  a  clarification  of  the  role 
of  domain  knowledge  in  CBR  systems  is  needed  to  improve  the  understanding 
of  CBR  and  the  development  of  CBR  systems. 

As  a  result  of  focusing  on  the  adaptation  process,  ABC  suggests  that  re¬ 
trieval  (and  similarity  assessment)  is  also  a  type  of  domain  knowledge.  In  our 
approach,  solving  a  problem  is  constructing  a  case-specific  model  of  the  “input 
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problem” — as  was  established  in  the  knowledge-level  description  of  CBR  [3].  A 
software  architecture  is  a  much  refined  level  of  description,  so  solving  a  problem 
in  ABC  involves  building  a  case-specific  model  that  satisfies  the  task  description 
goals.  Domain  knowledge  is  used  to  perform  the  inference  necessary  to  build  this 
model1.  The  ABC  architecture  does  not  deal  with  control  aspects  of  the  imple¬ 
mentation,  thus  the  order  in  which  domain-specific  inference  and  case  retrieval 
are  performed  is  unspecified. 


3  Implementing  Adaptors 

We  will  consider  two  kinds  of  adaptors:  domain  adaptors  ( d-adaptors )  and  case- 
based  adaptors  (c- adaptors) .  “Transformational  adaptation”  is  realized  by  d- 
adaptors,  i.e.  by  adaptors  that  use  some  domain-specific  knowledge  to  transform 
the  case-specific  model  (in  a  way  specified  by  the  adaptor’s  competence).  More¬ 
over,  that  domain  knowledge  is  the  one  explicitly  required  by  the  adaptor’s 
assumptions. 

“Derivational  replay”  is  realized  by  c-adaptors.  Case-based  adaptors  also 
transform  the  case-specific  model  but  use  some  domain  knowledge  that  includes 
a  precedent  case  retrieved  from  case  memory2.  In  the  simplest  scenario  there 
is  only  on  retrieved  case,  but  in  CBR  systems  where  parts  of  cases  are  also 
cases  each  part  can  be  adapted  in  a  case-based  way  by  c-adaptors.  Derivational 
replay  in  planning  is  one  example  and  the  SaxEx  system  [9]  is  another  example 
—  adaptation  in  the  SaxEx  system  combines  d-adaptors  and  c-adaptors. 

The  main  issue  to  go  from  a  specification  like  ABC  to  an  actual  implemen¬ 
tation  is  deciding  how  is  1)  the  representation  of  components  and  bridges,  and 
2)  the  control  scheme.  We  are  implementing  adaptors  in  Noos,  a  representation 
language  designed  for  supporting  knowledge  modeling  approaches  to  problem 
solving  and  learning  [2]  in  which  different  CBR  systems  have  been  built,  includ¬ 
ing  SaxEx.  In  Noos  cases  are  represented  as  feature  terms  [8],  a  formalism  for 
representing  structured  cases  in  which  any  subpart  of  a  case  (feature  term)  is 
also  a  term — and  thus  is  also  a  case.  Inference  is  provided  by  problem  solving 
methods  (PSMs)  that  use  domain  knowledge  to  build  models  (or  parts  of  mod¬ 
els).  A  problem  is  solved  when  a  case-specific  model  is  completed,  and  then  it  is 
retained  in  the  case  base.  Retrieval  is  performed  by  specialized  PSMs,  retrieval 
methods,  that  use  domain  knowledge  or  heuristic  principles  to  search  the  case 
base.  Concerning  the  control  scheme,  Noos  inference  is  on  demand,  i.e.  follows  a 
lazy  evaluation  strategy.  The  chain  of  control  is  thus  backwards:  retrieval  meth¬ 
ods  determine  the  features  of  a  case  that  they  need,  thus  forcing  the  evaluation  of 

1  Some  CBR  papers  distinguish  between  primary  and  derived  feature  cases.  Primary 
features  are  those  appearing  on  the  “input  case"  and  derived  features  are  inferred  by 
the  system  from  primary  features.  In  our  approach,  inference  uses  domain  knowledge 
(including  cases)  to  build  a  model  of  the  problem. 

2  Recall  that,  for  the  ABC  architecture,  the  base  of  cases  is  also  part  and  parcel  of  the 
domain  model. 
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the  PSMs  that  infer  those  features  needed  that  were  not  part  of  the  input  prob¬ 
lem  model.  Moreover,  c- adaptors  use  retrieval  methods  so  the  retrieval  process 
is  in  fact  directed  by  the  adaptation  strategy. 

The  main  ABC  elements  incorporated  in  Noos  are  i)  an  explicit  description 
of  a  task,  ii)  adaptors,  iii)  and  ta-bridges.  Since  the  rest  of  the  ABC  elements 
is  obviated,  some  parts  of  this  elements  need  not  be  represented  explicitly:  the 
reason  being  that  Noos  will  not  be  reasoning  about  them.  Thus,  a  task  holds 
only  goals  and  preconditions,  while  adaptors  holds  only  competence  and  precon¬ 
ditions.  Assumptions  are  not  present  since  we  are  not  representing  td-bridges 
nor  da-bridges.  The  contents  of  these  slots  (goals,  competence,  preconditions) 
are  expressed  by  feature  terms.  Satisfaction  is  represented  as  feature  term  sub¬ 
sumption  (C),  thus  a  case-specific  model  C  satisfies  an  adaptor  preconditions 
Ap  when  AP  C  C  ( AP  subsumes  C). 

The  overall  adaptation  process  is  realized  following  an  “Adaptation  as  Search” 
strategy.  The  initial  state  is  the  case-specific  model  of  the  problem;  this  begins 
with  the  information  given  as  input,  but  the  domain  PSMs  can  enlarge  this 
model  performing  inference  as  needed.  The  goal  state  is  a  complete  and  correct 
case-specific  model  Cf  that  satisfies  the  task  goals  To-  The  ta- bridge  provides 
a  translation  from  the  task  description  vocabulary  to  the  domain  vocabulary 
used  in  adaptors  and  case  specific  models.  Thus,  the  task  goals  expressed  in 
domain  vocabulary  are  obtained  applying  the  bridge  Bta  to  the  task  goals 
Bta{Tg)  and  therefore  a  solution  is  defined  as  a  case-specific  model  Cf  such 
that  Bta{Tg)  C  Cf- 

Adaptors  are  applied  to  the  case-specific  model,  performing  search  until  a 
sequence  of  adaptor  instantiations  is  found  such  that  transforms  the  initial  case- 
specific  model  into  Cf-  A  classical  means  ends  analysis  technique  is  used  with 
the  adaptors,  where  preconditions  establish  if  the  adaptor  is  applicable  to  a 
particular  case-specific  model,  and  competence  establishes  the  goals  or  subgoals 
achievable  by  instantiating  the  adaptor.  Since  Noos  provides  automatic  back¬ 
tracking,  selection  of  adaptors  and  adaptor  instantiation  following  several  search 
strategies  — such  as  depth-first,  breadth-first,  and  beam  search—  can  be  easily 
implemented  for  a  particular  CBR  system.  An  interesting  issue  left  for  future 
work  is  performing  a  case-based  search  of  adaptor  selection  and  instantiation: 
since  adaptors  are  feature  terms,  they  are  stored  in  memory  by  Noos  and  they 
are  thus  amenable  to  be  retrieved.  This  case-based  adaptation  process  would 
be  able  to  use  both  c-adaptors  and  d-adaptors,  unifying  “transformational”  and 
“generative”  adaptation  in  a  case-based  reuse  of  cases. 

4  Discussion  and  Related  work 

A  conceptual  framework  for  describing  CBR  systems  is  Richter’s  knowledge  con¬ 
tainers  [10].  An  approach  towards  a  formal  model  of  transformational  adapta¬ 
tion  based  on  the  knowledge  containers  framework  is  presented  in  [4].  The  pur¬ 
pose  of  Bergmann  and  Wilke’s  paper  is  to  characterize  when  properties  such  as 
soundness  and  completeness  can  be  formally  proven  to  hold  in  transformational 
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adaptation.  Interestingly,  their  approach  centered  on  adaptation  also  seems  to 
downplay  the  importance  of  retrieval  (and  similarity)  in  CBR  systems,  in  a 
similar  way  as  the  ABC  architecture  conceives  of  retrieval  as  a  part  of  domain 
knowledge.  In  our  approach  it  is  up  to  the  designer  of  a  CBR  system  to  decide 
whether  completeness  is  required  or  possible.  Moreover,  the  designer  may  decide 
to  use  a  logical  language  for  specifying  a  ABC  architecture  and  then  formally 
prove  that  certain  formal  properties  hold.  For  an  approach  of  using  UPML  with 
automated  reuse  see  [5].  It  is  an  interesting  question  whether  the  knowledge 
containers  framework  could  be  refined  to  provide  a  software  architecture  with 
the  containers  as  components — in  which  case  appropriate  connectors  should  be 
defined. 
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Abstract  Recent  successful  applications  of  AI  planning  technology  have  high¬ 
lighted  the  knowledge  engineering  of  planning  domain  models  as  an  important 
research  area.  We  describe  an  implemented  translation  algorithm  between  two 
languages  used  in  planning  representation:  PDDL,  a  language  used  for  commu¬ 
nication  of  example  domains  between  research  groups,  and  OCLh ,  a  language 
developed  specifically  for  planning  domain  modelling.  The  algorithm  is  being 
used  as  part  of  OCLh' s  tool  support  to  import  models  expressed  in  PDDL  to 
OCLh ’s  environment.  Here  we  outline  the  translation  algorithm,  and  discuss  the 
issues  that  it  uncovers.  Although  the  tool  performs  reasonably  well  when  its  out¬ 
put  is  measured  against  hand-crafted  OCLh ,  it  results  in  only  partially  specified 
models.  Analyis  of  the  translation  results  shows  that  this  is  because  many  natural 
assumptions  about  domains  are  not  captured  in  the  PDDL  encodings. 


1  Introduction 

Despite  many  years  of  research  into  AI  Planning,  knowledge  engineering  for  applica¬ 
tions  of  planning  technology  is  in  its  infancy.  Recent  successful  AI  planning  applica¬ 
tions  [10, 14, 1]  have  nonetheless  highlighted  the  problems  facing  knowledge  engineer¬ 
ing  in  planning.  Questions  include  how  to  choose  appropriate  planner  technology  for  a 
given  application,  and  how  to  encode  knowledge  into  domain  models  for  use  with  plan¬ 
ning  algorithms.  The  engineering  of  knowledge-based  planners  has  resulted  in  a  set  of 
workshops  and  initiatives,  including  those  in  references  [2, 12],  An  accepted  syntax  for 
exchange  of  domain  encodings  is  PDDL,  a  planning  domain  definition  language,  and 
many  established  planners  can  be  obtained  via  the  internet  with  a  set  of  domains  en¬ 
coded  in  this  syntax.  PDDL  emerged  from  the  need  to  construct  a  common  language 
for  the  biannual  AIPS  competitions  (for  details  of  PDDL  and  domain  examples  consult 
reference  [3]).  Language  conventions  such  as  PDDL  help  the  research  community  to 
some  extent  in  the  problems  of  exchanging  research  information,  and  in  the  indepen¬ 
dent  validation  of  research  results. 

Domain  definition  languages  such  as  PDDL,  however,  are  not  designed  with  the 
same  criteria  in  mind  as  a  domain  modelling  language.  The  latter  would  be  associated 
with  a  domain  building  methodology,  be  structured  to  allow  the  expeditious  capture 
of  knowledge,  and  have  the  benefit  of  a  tools  environment  for  knowledge  engineering. 
OCLh  stems  from  a  family  of  fairly  simple  planning-oriented  domain  modelling  lan¬ 
guages  deriving  from  the  work  in  reference  [8].  The  benefit  in  using  OCLh  is  seen  as 
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twofold:  to  improve  the  planning  knowledge  acquisition  and  validation  process;  and  to 
improve  and  clarify  the  plan  generation  process  in  planning  systems.  A  range  of  plan¬ 
ners  have  been  implemented  for  use  with  OCLh  and  the  language  is  being  used  as  a 
prototype  for  a  collaborative  UK  project  to  create  a  knowledge  engineering  platform 
for  planning  [13].  OCLh  is  structured  to  allow  the  capture  of  object  and  state-centred 
knowledge,  as  well  as  action-centred  knowledge, 1  and  it  is  encased  in  a  tools  environ¬ 
ment. 

In  this  paper  we  discuss  the  issues  raised  in  the  construction  of  one  of  the  tools 
in  OCLh  s  environment:  a  translator,  to  help  import  models  written  in  PDDL  into  the 
OCLh  environment.  The  translation  is  feasible  because  PDDL  and  OCLh  share  similar 
underlying  assumptions  about  worlds  -  they  are  assumed  closed,  actions  are  determin¬ 
istic  and  instantaneous.  We  outline  the  translation  process,  and  the  results  in  applying 
it  to  example  domain  models.  The  tool’s  output  is  used  to  both  make  comparisons  with 
hand-crafted  models,  and  to  identify  omissions  and  insecurities  in  the  PDDL  encodings. 
In  effect,  it  is  not  possible  for  the  tool  to  derive  secure  OCL  encodings  as  the  notion  of 
’’valid  state”  is  present  neither  implicitly  or  explicitly  in  the  PDDL  encodings.  This  has 
serious  consequences  for  the  communication  and  maintenance  of  domain  descriptions 
within  this  medium,  as  the  set  of  states  in  which  an  action  can  be  executed  wll  generally 
contain  many  states  which  are  not  sensible.  For  example,  while  it  is  understood  that  do¬ 
main  writers  encode  all  the  positive  preconditions  that  must  be  true  before  an  action  is 
executed,  the  language  does  not  encourage  the  recording  of  propositions  that  must  not 
be  true  for  the  execution  to  make  sense. 

2  The  Planning  Domain  Definition  Language 

PDDL  was  established  by  the  AIPS-98  Competition  Committee  to  enable  competitors 
to  have  a  common  language  for  defining  domains,  and  to  aid  the  development  of  a 
set  of  problems  written  in  PDDL  on  which  the  different  planners  could  be  tested  [3]. 
PDDL  has  been  incrementally  extended  to  include  a  wide  range  of  syntactic  features, 
although  most  planners  cited  in  the  literature  utilise  only  basic  features.  Planners  can 
be  restricted  to  a  PDDL  subset  by  declaring  those  language  features  required  when  the 
domain  is  defined.  Here  we  only  mention  those  features  relevant  to  the  paper. 

PDDL’s  basic  level  of  representation  is  the  literal,  and  a  model’s  central  element  is  a 
set  of  operator  schemae  representing  generalised  domain  actions  (very  much  in  the  style 
of  ‘classical  planning’  literature  with  its  roots  in  STRIPS  [4]).  Each  operator  is  defined 
with  a  precondition  and  effect,  where  the  semantics  are  interpreted  under  the  STRIPS 
assumptions.  Below  are  two  examples  of  simple  PDDL  operator  definitions  which  use 
typed  parameters.  They  belong  to  an  encoding  of  an  example  domain  called  the  Tyre 
World  which  was  taken  from  the  distribution  examples  associated  with  reference  [3].  A 
planner  using  the  Tyre  World  should  be  able  to  output  sequences  of  ground  operators 
to  solve  goals  involving  changing  a  flat  tyre.  We  will  use  this  domain  as  a  ’’running 
example”. 

1  traditionally,  models  of  planning  domains  were  equated  with  a  set  of  action  specifications,  and 
were  therefore  only  ‘action-centred’ 
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(: action  loosen 

parameters  (?n  -  nut  ?h  -  hub) 

:precondition  (and  (have  wrench) ( tight  ?n  ?h)  (on-ground  ?h)  ) 
:effect  (and  (loose  ?n  ?h)  (not  (tight?n  ?h) ) ) ) 

(taction  fetch 

tparameters  ( ?x  -  (either  tool  wheel)  ?c  -  container) 
precondition  (and  (in  ?x  ?c)  (open  ?c) ) 
teffect  (and  (have  ?x)  (not  (in  ?x  ?c) ) ) ) 

The  loosen  operator  models  the  action  of  undoing  (but  not  removing)  the  nuts  that 
fasten  a  wheel  onto  a  hub.  The  fetch  operator  models  the  action  of  removing  a  tool  or  a 
wheel  from  a  container  in  which  it  was  stored  (such  as  a  car’s  trunk). 

Problems  for  a  planner  to  solve  are  posed  in  PDDL  as  an  initial  state  (a  set  of  ground 
literals)  and  a  goal  condition.  Although  the  current  PDDL  version  includes  many  other 
features  (hierarchically-defined  operators,  domain  axioms,  safety  constraints,  quantifi¬ 
cation  over  parameter  domains  etc)  the  majority  of  domain  encodings  and  test-sets 
available  via  the  internet  input  the  simple  form  of  PDDL  similar  to  that  described  above. 

3  The  Object-Centred  Language  OCLh 

OCLh  was  designed  to  be  a  kind  of  ‘lifted’  STRIPS-language,  aimed  to  keep  the  gener¬ 
ality  of  classical  planning  but  to  incorporate  a  model-building  method  and  be  structured 
to  help  the  validation  and  operationalisation  of  domain  models  [12,6,8].  An  OCLh 
world  is  populated  with  dynamic/static  objects  grouped  into  sorts2.  Each  dynamic  ob¬ 
ject  exists  in  one  of  a  well  defined  set  of  states  (called  ‘substates’),  where  these  substates 
are  characterised  by  predicates.  On  this  view  the  application  of  an  operator  will  result 
in  some  of  the  objects  in  the  domain  moving  from  one  substate  to  another.  In  addi¬ 
tion  to  describing  the  actions  in  the  problem  domain,  OCLh  provides  information  on 
the  objects,  their  sort  hierarchy  and  the  permissible  states  that  the  objects  may  be  in. 
Relations  and  propositions  are  not  fully  independent  entities  -  rather  they  now  belong 
to  collections  that  can  be  manipulated  as  a  whole.  So  instead  of  dealing  with  literals 
planning  algorithms  reason  with  objects.  Similarly  to  a  typed  PDDL  specification,  the 
objects  and  the  sorts  they  belong  to  are  predefined,  as  is  the  sort  of  each  argument  of 
each  predicate  in  the  OCLh  model. 

An  object  description  in  a  planning  world  is  specified  by  a  tuple  (s,  i,  ss ),  where  s 
is  a  sort  identifier,  i  is  an  object  identifier  of  sort  s,  and  ss  is  its  substate.  A  substate  is 
a  set  of  predicates  which  all  describe  i.  For  example,  (nut,  nutO,  [loose(nutO,  hub 1)]) 
is  an  object  description  meaning  that  nutO  of  sort  nut  is  loosely  done  up  on  hubl.  Or 
again,  (container,  trunkl,[closed(trunkl),locked(trunkl)])  is  an  object  descrip¬ 
tion  meaning  that  container  trunk  1  is  closed  and  locked.  Only  a  restricted  set  of  pred¬ 
icates  are  allowed  to  describe  an  object  and  appear  in  its  substate.  Substates  operate 
under  a  closed  world  assumption  local  to  this  restricted  set  -  thus  in  the  last  example, 
the  predicate  open(trunk  1)  is  false  because  (a)  it  is  used  to  describe  objects  of  this 

2  we  use  the  name  ‘sorts’  rather  than  ‘object  classes’  to  emphasis  that  OCLh  is  an  abstract 
object-centred  modelling  language  -  in  contrast  to  an  00  implementation  language 
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sort  (b)  it  does  not  appear  in  the  substate.  The  domain  modeller  defines  the  predicates 
used  to  describe  objects,  and  the  form  of  each  substate,  using  substate  class  defini¬ 
tions.  The  predicate  expressions  in  such  definitions  are  constructed  to  form  a  complete, 
disjoint  covering  of  the  space  of  substates  for  objects  of  each  sort.  When  fully  ground, 
an  expression  from  a  substate  class  definition  forms  a  legal  substate.  For  example,  the 
substate  class  definitions  for  the  sorts  container,  nut  and  hub  are3: 


substate_classes (container, C, 

[ [closed (C) ] , [open (C) ] , [closed (C) , locked (C) ] ] ) 
substate_classes (nut,N, [ [loose(N,H) ] , [tight (N,H) ] , [off_hub(N) ] ] ) 
substate_classes (hub, H, [ [on_ground (H) , fastened (H) ] , 

[ jacked_up (H, J) , fastened (H) ] , [unfastened (H) , jacked__up (H, J) ] , 

[ free (H) , jacked_up (H, J) , unfastened (H) ] ] ) 


The  first  example  means  that  objects  of  sort  container  can  be  either  closed,  not 
open  and  not  locked,  or  open,  not  closed  and  not  locked,  or  closed,  locked  and  not 
open  {or  here  is  exclusive).  Thus,  in  OCLh  negation  is  implicit:  if  it  is  the  case  that 
-i open(trunkl ),  then  this  means  that  trunk  1  must  be  in  one  of  two  substates,  its  ob¬ 
ject  description  being  ( container ,  trunkl,  [ closed(trunkl )])  or  ( container ,  trunk 1, 
[closed(trunkl) ,  locked(trunkl )]). 

A  domain  model  is  built  up  in  OCLh  by  creating  the  operator  set  at  the  same  time 
as  creating  the  substate  class  definitions.  We  define  an  object  expression  to  be  a  tuple 
( s,i ,  se)  such  that  the  expression  part  se  is  a  generalisation  of  one  or  more  substates 
(se  is  normally  a  set  of  predicates  containing  variables).  An  object  transition  to  be 
an  expression  of  the  form  (s,  i,  se  =>  ssc)  where  i  is  an  object  identifier  or  a  variable 
of  sort  s,  ( s,i,se )  forms  a  valid  object  expression,  and  ssc  is  taken  from  one  of  the 
substate  class  definitions.  Thus  when  ssc  is  ground  it  will  always  be  a  valid  substate. 
An  action  in  a  domain  is  represented  by  an  operator  schema  with  an  identifier,  a  prevail 
condition,  and  a  list  of  transitions.  Each  expression  in  the  prevail  condition  must  be  true 
before  execution  of  the  operator,  and  will  remain  true  throughout  operator  execution. 

Two  OCLh  operators  hand-crafted  to  (loosely)  correspond  to  the  PDDL  operators 
above  are  as  follows: 

operator (loosen(N,H,W) , 

[  (wrench, W, [have (W) ]) ,  (hub, H, [on_ground (H) , fastened (H) ] )  ], 
[ (nut,N, [tight (N,H) ] => [loose (N,H) ] ) ]  ) 
operator ( fetch ( T , C ) , 

[ (container, C, [open(C) ] ) ] , 

[ ( tool_or_wheel , T, [in (T, C) ] => [have (T) ] ) ] ) 

As  with  PDDL,  OCLh  has  many  other  features  such  as  conditional  operators,  hier¬ 
archical  operators,  atomic  and  general  invariants,  but  due  to  lack  of  space  we  refer  the 
reader  to  the  literature  for  these  details. 


3  whereas  in  PDDL  we  write  a  variable  as  an  identifier  beginning  with  ‘?’,  in  OCLh  variables 
are  identifiers  with  leading  capitals 
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4  Lifting  PDDL  To  OCLh 

The  general  framework  :  We  base  the  translation  on  two  main  assumptions:  (1)  the 
input  to  the  translator  will  be  any  model  written  in  the  subset  of  PDDL  that  includes 
STRIPS-like  operators  with  literals  having  typed  arguments.  This  translator  will  be  ad¬ 
equate  for  our  purposes  as  the  test  sets  in  use  in  the  AIPS  competitions  and  those  avail¬ 
able  in  resource  web  sites  are  generally  no  more  expressive.  Where  the  model  is  written 
without  typed  arguments,  it  can  be  augmented  by  hand  or  a  tool  such  as  TIM  [5]  can  be 
used  to  provide  the  typing.  (2)  the  translation  should  keep,  as  far  as  possible,  the  names 
and  structure  of  the  input  model.  This  leads  us  to  the  following  general  framework  for 
translation: 

PDDL  parameter  type  name  =>  OCLh  sort  name 

PDDL  predicate  =>  OCLh  predicate 

PDDL  operator  name  =>■  OCLh  operator  name 

The  first  association  preserves  the  correspondence  of  PDDL  primitive  types  and  OCLh 
sorts  but  does  not  guarantee  conformity  to  the  OCLh  requirements  for  a  sort  hierarchy 
as  PDDL’s  sort  hierarchy  is  not  required  to  form  a  tree  structure.  This  problem  can 
however  be  ignored  for  the  STRIPS-like  domains  we  are  interested  in  here  as  non¬ 
primitive  types  are  just  a  device  to  allow  single  operators  to  describe  transformations  to 
objects  of  diverse  types.  The  second  association  raises  problems  concerning  the  issue  of 
grouping  predicates  to  form  substate  class  definitions  as  discussed  below.  Once  this  is 
done,  re-writing  the  PDDL  operators  by  extracting  the  object  transitions  and  the  prevail 
clauses  from  the  raw  STRIPS  operator  is  relatively  straightforward. 

Inducing  Substate  Classes  :  Steps  in  the  OCLh  method  that  are  used  to  derive 
substate  class  definitions  are  as  follows: 

1.  Identify  the  sorts  that  are  dynamic  and  those  that  are  static 

2.  For  each  dynamic  sort,  identify  those  predicates  that  are  to  be  included  in  defining  its 
substate  classes 

3.  For  each  dynamic  sort,  define  its  substate  classes 

For  step  1,  a  sufficient  condition  for  a  sort  s  to  be  dynamic  is  that  PDDL  type  s  is  de¬ 
scribed  by  a  property  which  can  be  changed  by  a  PDDL  operator.  Those  types  that  have 
no  changeable  properties,  but  are  referred  to  within  a  changing  relation  may  or  may  not 
be  mapped  to  a  dynamic  sort  -  this  choice  will  become  clear  after  our  discussion  of  step 
2.  In  step  2,  the  problem  alluded  to  above  arises  in  that  given  a  predicate  p(sl,  s2, ..,  sn ), 
what  subset  of  the  OCLh  sorts  si,  s2, ...  sn  should  it  be  associated  with,  to  describe  that 
sorts’  substates  classes?  In  the  method  associated  with  OCLh  it  is  proposed  that  nor¬ 
mally  each  predicate  describe  a  single  sort  (although  if  the  sort  were  not  primitive  the 
predicate  would  be  used  in  distinct  primitive  sorts).  To  illustrate  this  problem  consider 
the  PDDL  predicate  in,  with  two  arguments  of  type  tool  and  container  respectively. 
Both  types  are  mapped  over  to  OCLh  dynamic  sorts,  and  the  question  arises:  should 
the  predicate  ”in”  be  used  to  describe  the  state  of  an  object  of  sort  tool,  the  state  of 
an  object  of  sort  container,  or  both?  Though  from  a  logical  point  of  view  there  is 
no  more  reason  to  say  that  the  predicate  in  characterises  a  tool  than  there  is  to  say  it 
characterises  a  container  there  are  strong  pragmatic  reasons  to  classify  the  predicate  as 
belonging  to  only  one  of  the  objects  referenced.  If  we  allow  predicates  to  describe  all 
its  sorts’  states  then  there  is  a  clear  redundancy  in  our  representation,  in  that  we  record 
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the  same  information  twice.  More  serious  than  this,  allowing  a  relational  predicate  to 
characterise  all  referenced  sorts  introduces  the  frame  problem  in  a  particularly  acute 
manner.  Recall  that  the  right  hand  sides  of  OCLh  transitions  must  fully  characterise 
the  resulting  substate  of  the  dynamic  object  participating  in  the  transition,  without  de¬ 
fault  persistence  but  with  a  closed  world  assumption  local  to  the  predicates  describing 
that  sort.  Then  to  record  the  possible  substates  of  the  container  we  would  have  to  con¬ 
sider  the  possible  combinations  of  the  container  being  open,  closed  and  locked  along 
with  all  possible  combinations  of  objects  such  as  the  tools  and  wheels  being  either  in 
or  not  in  the  container;  this  would  lead  to  a  proliferation  of  object  transitions  and  op¬ 
erators.  The  discussion  above  shows  that  it  is  not  practicable  to  let  a  predicate  be  used 
in  the  substate  descriptions  of  objects  in  every  one  of  its  argument  sorts.  Our  solution 
to  this  frame  problem  is  to  try  to  follow  the  intuition  in  building  an  OCLh  model  man¬ 
ually:  let  the  algorithm  choose  one  single  sort.  This  distinguished  sort  is  said  to  own 
the  predicate.  Though  from  a  logical  point  of  view  this  may  seem  arbitrary  it  coincides 
with  intuition  in  the  sense  that  we  would  not  naturally  think  of  the  action  of  opening 
the  trunk  as  having  a  different  result  depending  on  the  trunk’s  contents.  In  English  an 
action  verb  is  typically  thought  of  as  characterising  the  subject  of  the  sentence  rather 
than  the  object.  In  this  spirit  we  say  the  the  predicate  in(wrench, trunk)  describes  the 
state  of  the  wrench  but  not  that  of  the  trunk.  Given  that  we  will  only  allow  a  predicate 
to  characterise  a  single  sort  the  choice  of  sort  could  be  made  in  a  number  of  competing 
ways.  We  could  try  to  allocate  predicates  to  sorts  in  a  way  to  try  and  minimise  the  frame 
problem  or  to  minimise  the  number  of  sorts  that  change  state  in  the  actions  concerned, 
or  we  could  simply  allocate  them  to  the  first  mentioned  object  in  each  predicate.  Up  to 
now  our  experiments  have  shown  the  third  strategy  gives  satisfactory  results  when  the 
auto-generated  OCLh  model  has  been  compared  to  a  hand  crafted  version.  Returning 
to  step  1,  this  analysis  determines  the  split  of  static  and  dynamic  sorts:  if  some  dynamic 
predicate  has  the  property  that  its  first  argument  can  contain  object  identifiers  of  sort  s, 
then  s  is  a  dynamic  sort;  otherwise,  s  will  not  be  described  by  any  dynamic  predicates 
and  hence  will  be  static. 

Dealing  with  Negation  :  Negation  in  OCLh  is  not  represented  explicitly,  because 
of  the  local  closed  world  assumption  used  in  substates.  It  may  be  the  case,  however, 
that  a  negative  form  (or  opposite)  is  required.  We  deal  with  this  by  potentially  creating 
for  each  predicate  a  negative  form,  identified  by  prepending  the  predicate  with  not- 
Though  we  start  with  the  availability  of  all  such  possible  negations,  not  all  are  used  in 
the  final  translation.  This  would  be  the  case  in  the  Tyre- World  if  the  container/trunk 
is  described  as  either  open  or  closed  in  which  case  the  negative  forms  not-.open  and 
not^closed  are  not  needed. 


5  The  Translation 

Overall  Results :  The  associations  described  above  have  been  implemented  in  a  PDDL- 
OCL  translation  tool.  The  translator  has  been  tested  on  four  typed  PDDL  worlds:  Tyre- 
World,  Ferry,  Gripper  and  Fridge.  Details  of  the  algorithm  and  results,  which  have  been 
omitted  here  due  to  lack  of  space  are  given  in  [9], 


616  R.M.  Simpson  et  al. 

The  translation  results  are  encouraging  in  that  they  are  close  to  those  produced  by 
hand  translation  from  the  same  PDDL  source.  To  indicate  the  nature  of  the  results,  of 
the  thirteen  actions  in  the  Tyre  World  two  of  the  actions  contained  anomalies  flagged  up 
by  the  translation.  Eight  of  the  translated  actions  contained  unnecessary,  though  correct, 
negations  on  their  right  hand  sides  and  two  actions  had  incomplete  object  transitions. 
In  the  Ferry  world  the  translation  was  capable  of  running  by  a  planner  by  the  addition 
of  a  missing  object  state  to  the  problem  specification. 

Anomolies  :  The  translation  forms  a  basis  for  hand  completion  but  also  has  the 
power  to  flag  up  potential  problems  and  insecurities  with  the  PDDL  domain  specifica¬ 
tion.  The  most  interesting  of  the  anomalies  uncovered  in  the  Tyre  World  domain  occurs 
with  the  jack-down  action  which  is  translated  as  follows: 

(: action  jack-down 

: parameters  (?h  -  hub) 
precondition  (not  (on-ground  ?h) ) 
reffect  (and  (on-ground  ?h)  (have  jack))) 

operator ( j  ack_down ( H )  , 

□  , 

[ (hub,H, [not_on_ground(H) ] => [on_ground (H) ] ) , 

( tool , jack, []=> [have ( jack) ]) ]  ) 

The  transition  for  the  jack  indicates  that  it  may  be  in  any  state  prior  to  being  pos¬ 
sessed  as  a  result  of  jacking  down  the  wheel.  This  is  not  adequate  as  the  mechanic  only 
possesses  the  jack  after  execution  of  the  action  because  it  was  used  to  jack-up  the  wheel 
in  the  first  place.  The  PDDL  formulation  works  (operationally)  because  in  the  version 
of  the  domain  used  there  is  no  alternative  way  of  getting  the  wheel  off  the  ground  (al¬ 
though  we  might  have  an  alternative  jack-up  action,  such  as  use  a  block  and  tackle). 

A  second  anomaly  which  arises  with  the  encoding  of  the  jack-down  action  is  that  we 
treat  [  on_gr ound  ( H )  ]  as  a  complete  substate  of  the  hub.  From  the  auto-generated 
substate  class  definition  for  the  hub  we  see  that  either  the  predicate  fastened(H)  or 
unfastened(H)  and  either  the  predicate  free(H)  or  notJ'ree(H)  must  also  apply  to  the  the 
hub  and  this  raises  the  following  question:  should  it  not  be  the  case  that  we  should  make 
it  a  precondition  of  the  action  that  the  hub  has  the  wheel  on  and  fastened  to  it  prior  to 
jacking  down?  In  effect,  in  OCLh  terms  the  transition  should  be 

(hub,H, [not_on_ground(H) , fastened(H) ,not_free(H)  ]=> 

[ on_ground ( H ) , fastened (H) ,not_free(H) ] ) , 

This  example  raises  a  more  general  problem  with  the  PDDL  representation  of  plan¬ 
ning  domains.  As  noted  OCLh  requires  the  right  hand  side  of  an  object  transition  to 
completely  characterise  the  resulting  state  of  the  object.  In  the  PDDL  representation  the 
state  of  an  object  is  not  fully  determined  by  the  application  of  an  operator  as  some  of  the 
object’s  properties  my  simply  persist  without  being  referenced  by  the  operator  from  an 
earlier  state.  This  makes  it  impossible  to  determine  which  states  of  objects  are  legal  in 
the  domain.  In  general  if  we  have  n  predicates  characterising  an  object  sort,  excluding 
their  negations,  there  are  2n  possible  substates  for  objects  of  that  sort.  Accordingly  for 
example  for  the  hub  described  by  the  predicates  (on-ground(H),  fastened(H),  free(H) 


Knowledge  Representation  in  Planning  617 


there  are  eight  possible  such  substates.  From  the  PDDL  operators  some  of  these  states 
may  be  reachable  and  others  not  but  we  cannot  definitively  exclude  any  from  the  sub¬ 
state  class  definition  as  all  may  be  candidate  start  states  for  an  imaginable  problem.  But 
this  is  inadequate  as  some  of  these  states  may  not  be  possible  such  as  on.ground(H)  A 
fastened(H)  !\free(H)  i.e.  hub  on  the  ground  with  the  nuts  fastened  up  but  no  wheel  on. 
Operationally  this  may  not  occur  if  the  hub  starts  off  in  a  sensible  state  but  we  should 
not  be  relying  on  such  a  procedural  definition  of  an  object’s  states  to  determine  what  is 
possible. 

Implicit  agents  :  A  problem  with  agents  of  actions  being  implicit  in  PDDL  domain 
specifications  arises  when  translating  to  OCLh .  The  problem  is  amply  illustrated  by 
the  following  translation  of  the  move  rule  from  the  Gripper  domain  where  we  have  a 
robot  that  can  move  from  a  named  location  to  another  named  location. 

operator (move (TO,  FROM), 

[]  , 

[  (room, TO, [ 1 => [at_robby (TO) 1 ) , 

(room, FROM, [at_robby (FROM) ] => [not_at_robby (FROM) ] ) ]  ) 

The  rooms  TO  and  FROM  are  being  classed  as  dynamic  objects  subject  to  change, 
but  we  would  more  naturally  want  to  say  that  it  is  the  robot  that  has  changed.  In  OCLh 
parlance  locations  should  be  treated  as  static.  To  solve  this  problem  we  need  to  recode 
the  at.robby  predicate  and  introduce  an  agent  i.e.  the  robot.  at(Agent, Location).  If  the 
agent  has,  as  in  this  case  been  implicitly  encoded  into  the  predicate  then  there  will  only 
be  one  such  agent  and  we  can  effectively  introduce  a  new  constant  agentO  and  a  new 
type  Agent. 

6  Discussion 

The  basic  strategies  of  re-casting  domain  knowledge  from  a  predicate  base  into  an 
object-centred  base  are  not  new  and  have  been  discussed  in  the  literature  for  some 
period  (e.g.  see  [11]).  OCLh  contrasts  with  previous  domain  modelling  languages  for 
planning  such  as  SIPE-2  and  O-Plan  [16, 15]  in  its  simplicity  and  clarity.  On  the  other 
hand,  OCLh  is  less  sophisticated  than  these  system  (for  a  comparison  of  O-Plan’s  T F 
and  OCLh  see  reference  [7]). 

Fox  and  Long  in  reference  [5]  show  that  the  limitation  of  requiring  arguments  to  be 
typed  in  the  PDDL  specification  is  not  fundamental  to  the  translation.  They  demonstrate 
that  type  information  can  be  extracted  from  a  set  of  PDDL  operator  schemae  only.  Fox 
and  Long’s  TIM  uses  the  operator  schemae  to  analyse  the  domain  and  produce  types 
such  that  objects  belonging  to  them  are  identical  up  to  naming.  It  therefore  appears  to 
produce  a  type  structure  more  appropriate  to  OCLh  ■  Our  future  work  will  involve  merg¬ 
ing  the  translation  algorithm  with  the  TIM  engine  to  create  a  more  powerful  translation 
tool. 

It  has  been  acknowledged  since  the  modern  inception  of  AI  that  the  representa¬ 
tion  of  knowledge  has  a  critical  bearing  on  the  performance  of  a  problem  solver.  In 
planning  especially,  there  have  been  relatively  few  insights  or  research  projects  in  this 
area  -  instead  the  planning  literature  has  tended  to  concentrate  on  the  efficiency  issues 
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of  planners,  or  the  adequacy  of  expression  of  their  domain  model  languages.  We  see 
our  ongoing  work  on  the  translation  from  PDDL  to  OCLh  as  promoting  the  debate  on 
the  relative  merits  of  planning  domain  encodings,  and,  in  time,  the  matching  up  of  ap¬ 
propriate  planner  technology  to  application  domain.  Working  with  a  domain  modelling 
language  such  as  OCLh  gives  opportunities  for  higher  level  domain  validation  with  rich 
tool  support  that  eases  domain  modelling.  Our  translator  from  PDDL  to  OCLh  gives 
us  access  to  a  rich  source  of  research  examples  written  in  PDDL,  although  it  shows 
up  the  lack  of  ’’knowledge  content”  in  these  encodings.  Also,  it  has  highlighted  those 
issues  in  representation  such  as  use  of  negation,  completeness  and  security  of  models, 
and  construction  of  object  hierarchies  that  are  fundamental  to  the  creation  of  a  planning 
domain  model. 
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Abstract.  This  paper  describes  a  method  to  construct  multiagent  sy¬ 
stems.  The  method  proposed  here  is  explored  accounting  for  the  Gib¬ 
son’s  ecological  view  of  information,  i.e.  affordance.  We  apply  the  idea 
of  affordance  not  only  to  the  reactive  models  of  agents  but  also  to  the 
deliberative  models  of  the  agents.  By  this  approach,  we  can  avoid  the 
frame  problems  that  emerge  from  the  dynamic  environment  including 
the  agent’s  mental  world.  We  describe  a  basic  representation  scheme  for 
agent  modules  which  reflects  the  Gibson’s  view  of  information  resources. 
As  a  system  description  language,  we  use  knowledge  processing  langu¬ 
age  KAUS  (knowledge  acquisition  and  utilization  system)  based  on  first 
order  logic  and  axiomatic  set  theory.  We  consider  as  an  example  the 
multi-strata  modeling  scheme  for  developing  human-computer  interac¬ 
tive  problem  solving  systems  that  are  essentially  multiagent  systems. 


1  Introduction 

The  more  the  information  systems  become  complex  and  large-scaled,  the  more 
it  becomes  important  for  the  systems  that  these  have  the  property  of  autonomy 
for  information  gathering,  processing  and  management.  The  agent  technology 
developed  so  far  [1,2,3]  has  been  much  applied  to  design  and  implement  such 
autonomous  systems  (autonomous  agent  systems).  We  can  see  the  typical  appli¬ 
cations  of  the  agent  technology  in  intelligent  robots  [1],  enterprise  models  [10] 
and  information  assistants  in  the  Web  environment  [3], 

The  common  problem  to  be  solved  for  constructing  agent  systems  is  how  to 
describe  agent  structures  (organizations),  its  functionality,  properties  and  the 
control  structures  in  general.  We  have  to  consider  especially  the  autonomy  of 
the  system,  that  is,  the  capabilities  of  learning  from  environments  through  the 
interaction  and  self  decision-making  for  attaining  his/her  goals.  In  the  multia¬ 
gent  environment,  the  coordination,  cooperation  and  communication  with  each 
individual  agent  should  be  considered.  We  believe  that  it  is  inevitable  for  us  to 
apply  knowledge  processing  technology  to  solve  these  problems. 
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The  other  points  to  be  considered  about  agent  systems  are  the  ability  of 
perception  of  information  resources  by  the  agents.  If  the  environment  with  which 
the  agents  interact  is  very  complex,  large,  uncertain  and  dynamical,  it  is  very 
hard  and  even  impossible  for  them  to  possess  beforehand  all  information  about 
it.  This  means  that  the  agent  should  have  a  mechanism  to  find  automatically 
information  necessary  in  the  environment.  The  more  importance  is  how  to  solve 
the  frame  problem  in  artificial  intelligence. 

In  this  paper,  we  show  that  the  Gibson’s  ecological  view  of  information  re¬ 
sources  denoted  by  affordance  [7]  is  valuable  to  solve  these  problems  described 
above.  In  Chapter  2,  we  briefly  summarize  the  notion  of  affordance,  the  frame 
problem  in  artificial  intelligence  [6]  and  a  link  between  them.  In  Chapter  3,  we 
describe  a  basic  representation  scheme  for  agent  modules  which  reflects  the  Gib¬ 
son’s  view  of  information  resources.  In  Chapter  4,  we  describe  a  language  which 
is  suitable  for  implementing  agent  systems,  specifically,  KAUS  language  [11,12] 
developed  by  us.  In  Chapter  5,  we  consider  as  an  example  the  multi-strata  mode¬ 
ling  scheme  for  developing  human-computer  interactive  problem  solving  systems. 
Finally,  we  give  concluding  remarks  in  Chapter  6. 

2  A  Link  between  the  Frame  Problem  in  AI  and 
Affordance  in  Psychology 

The  frame  problem  is  the  problem  of  describing,  computationally,  what  pro¬ 
perties  persist  and  what  properties  change  as  actions  are  performed.  In  the 
literature  [6],  the  two  frame  problems  are  discussed.  One  is  the  mathematical 
frame  problem  and  another  is  the  commonsense  frame  problem.  The  former  is 
concerned  with  the  intractability  of  time  and  space  for  computing  frame  axioms. 
The  latter  is  concerned  with  the  intrinsic  difficulty  of  axiomatization  and  con¬ 
ceptualization  of  the  significant  portion  of  the  real  world,  the  world  which  is 
complex,  large,  uncertain  and  dynamical,  and  ill-defined  by  nature.  The  qualifi¬ 
cation  problem  in  such  circumstances  has  to  be  solved. 

On  the  other  hand,  the  theory  of  affordance  established  by  ecological  psycho¬ 
logist  James  J.  Gibson  (1904-1979)  gives  us  a  new  point  of  view  on  perception 
of  objects  in  the  real  world.  According  to  Gibson,  the  correct  context  of  our 
perception  is  defined  by  the  interaction  between  us  and  the  real  world.  It  differs 
from  the  notion  in  traditional  cognitive  psychology  in  the  sense  that  perception 
and  action  are  not  treated  as  separate  processes  in  affordance  theory.  Organisms 
move  in  the  world,  finding  the  available  information  in  it  and  move  again  using 
the  information  found.  It  is  assumed  that  information  available  emerges  from 
what  is  maintained  in  the  real  world.  An  organism’s  activities  are  said  to  be 
directly  linked  to  what  the  objects  in  the  real  world  afford  (for  example,  a  chair 
affords  sitting). 

To  summarize,  the  frame  problem  means  that  it  is  fundamentally  difficult 
for  the  practical  AI  systems  to  provide  all  information  necessary  for  problem 
solving  in  advance  and  in  the  exhaustive  way.  It  also  denotes  that  the  centrali¬ 
zed  information  management  is  intractable  for  complex  and  large-scaled  problem 
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domains.  On  the  other  hand,  the  notion  of  affordance  shows  us  that  we  can  con¬ 
figure  AI  systems  in  such  a  manner  that  they  maintain  information  resources 
in  a  distributed  objects  in  the  problem  world  and  they  accept  on-demand  in¬ 
formation  processing  through  the  interaction  with  the  objects.  The  concurrent 
objects  [13],  agent-oriented  programming  [14]  and  more  broadly  saying  distribu¬ 
ted  artificial  intelligence  (DAI)  and  multiagent  systems  (MAS)  [15]  involve  the 
potential  use  of  the  notion  of  affordance.  Brook’s  subsumption  architecture  [1, 
2]  in  robotics  also  involves  the  potential  use  of  the  notion  of  affordance  to  avoid 
the  frame  problem  in  AI.  However,  the  Brook’s  architecture  is  only  concerned 
with  a  reactive  level  of  actions.  As  a  result,  the  potential  use  of  affordance  in 
the  deliberative  level  is  not  considered.  In  fact,  the  Gibson’s  original  ecological 
view  of  information  also  relates  only  to  the  reactive  level  of  perception.  It  de¬ 
nies  all  the  mental  states,  cognitive  maps  and  inference  in  the  organism’s  brain 
(similarly  the  Brook’s  architecture).  The  higher  level  of  intelligence  is  not  at  all 
referred  to  therein.  We  claim  this  restrictive  consideration  of  affordance  is  not 
enough  to  construct  truly  intelligent  systems.  In  general  settings,  intelligent  sy¬ 
stems  should  have  all  the  reactive,  cognitive  and  deliberative  performances.  The 
seamless  applications  of  affordance  to  all  the  reactive,  cognitive  and  deliberative 
performances  of  the  agents  should  be  considered.  Our  goal  in  this  paper  is  thus 
motivated  to  take  away  this  restrictive  consideration  in  developing  multiagent 
systems. 

3  Basic  Representation  Scheme  for  Individual  Agents 

As  described  in  the  previous  chapter,  every  constituent  of  the  environment,  na¬ 
mely  organisms  and  artifacts,  can  be  regarded  as  ecological  information  media¬ 
tors  and  processors.  Prom  the  ecological  viewpoint  of  information  resources,  not 
only  organisms  but  also  artifacts  can  be  regarded  as  agents  having  certain  af- 
fordances.  Consequently  the  basic  representation  scheme  for  agents  described  in 
this  chapter  can  be  applied  both  to  the  representation  of  organic  agents  and  to 
that  of  artificial  agents. 

3.1  Skeleton  Structure  of  an  Agent  Functional  Module 

An  agent  functional  module(M)  is  defined  with  a  set  of  variables  for  holding 
values  of  the  specified  attributes  and  a  set  of  primitive/compound  methods  (pro¬ 
cedures)  that  achieve  the  specified  goal  using  the  input  and  the  current  state 
of  the  defined  variables.  We  define  the  skeleton  structure  of  an  agent  functional 
module  M  as  follows  (see  Fig.l). 

(1)  The  name  of  the  module. 

(2)  The  layer  or  type  of  the  module. 

(3)  The  state  of  the  module. 

(4)  The  affordance. 

(5)  A  set  of  methods  for  perception,  action,  and  communication. 

(6)  The  local  working  memory. 
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NAME 
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PERCEPTION 

ACTION 

COMMUNICATION 

WORKING 

MEMORY 

Fig.  1.  Skeleton  structure  of  an  agent  functional  module 


The  name  of  the  module  is  a  unique  identifier  that  can  discriminates  itself 
from  other  agent  modules.  The  state  of  the  agent  module  holds  both  the  current 
state  and  the  past  state  (history)  in  the  predefined  time  period.  The  history 
describes  the  time  series  of  the  results  of  perceptions  and  actions  executed  by  this 
module.  The  layer  or  type  of  the  agent  module  is  used  for  classifying  this  module 
(see  the  next  section).  The  layers  considered  here  are  those  of  reactive  modules 
which  perform  reactive  goals,  mental  modules  which  perform  local  reflective 
thinking,  and  social  modules  which  perform  cooperative  tasks.  A  rational  agent 
is  made  up  from  these  three  layers  [1], 

Perception,  action  and  communication  are  component  modules  that  charac¬ 
terize  the  agent  functionality.  The  affordance  describes  the  set  of  attributes  and 
their  values  the  agent  module  can  afford  to  the  other  agent  modules.  This  set 
may  be  either  invariable  or  variable.  The  variable  set  means  that  some  attributes 
are  added  or  deleted  by  the  interaction  with  the  environment.  Some  attributes 
may  denote  roles  of  the  module.  We  want  to  stress  here  that  the  perception 
module  is  a  context  dependent  searcher  which  searches  for  the  necessary  infor¬ 
mation  from  the  affordances  that  are  maintained  by  the  other  agent  modules.  In 
this  sense,  the  perception  module  is  active  input  function  but  not  passive  input 
function.  1 

An  action  module  describes  an  output  function  that  changes  the  internal 
state  of  the  agent  module  and  affects  the  external  state  by  executing  an  action 
in  the  environment.  A  communication  module  describes  the  protocol  for  message 
passing  and  the  intention  of  acceptance/rejection  of  messages  to  and  from  the 
other  agent  modules.  The  communication  module  is  triggered  in  the  if-needed 
mode.  The  local  working  memory  is  a  memory  used  only  within  its  agent  module. 
It  is  a  temporary  memory  used  at  the  activation  time  of  this  agent  module. 

To  conclude,  the  frame  problems  described  in  the  previous  chapters  can  be 
avoided  by  making  cooperative  use  of  perception  and  communication  modules. 
That  is,  it  is  needless  for  each  individual  agent  to  know  all  things  about  his/her 

1  In  constraint  logic  programming,  active  constraints  and  passive  constraints  are  con¬ 
sidered.  The  author  believe  that  the  Gibson’s  view  of  perception  corresponds  to  the 
active  constraint  solving  in  constraint  logic  programming. 
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environment  but  only  required  to  interact  with  it  and  perceive  (search  for)  the 
necessary  information  which  is  afforded  by  the  other  agent  modules.  The  result  of 
this  interaction  affects  only  the  state  of  the  agents  concerned  with  the  interaction 
and  does  not  affect  the  states  of  the  other  agents. 

3.2  Organizing  Modules 

The  skeleton  structure  of  an  agent  module  defined  in  Sec. 3.1  consists  of  three 
functional  modules;  perception,  action  and  communication.  A  specific  reactive 
agent  is  described  by  the  set  of  specialized  reactive  modules.  A  cognitive  agent 
is  described  by  the  set  of  specialized  reactive  modules  and  cognitive  modules. 
A  social  agent  is  described  by  the  set  of  reactive  modules,  cognitive  modules 
and  social  modules.  A  rational  agent  is  either  a  cognitive  agent  or  a  social  agent. 
These  various  agents  are  organized  to  construct  a  multiagent  system.  Fig. 2  shows 


part-of 


rational  agent 


reactive  mod. 


cognitive  mod.  1  social  mod. 
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I 
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( external)  J 


[mental  world  1 
(internal)  J 


[social  world 
( extemal/in  ternal)\ 


Fig.  2.  Representation  of  Rational  Agent 

that  a  social  rational  agent  is  composed  of  the  bottom  layer  (reactive  modules), 
the  middle  layer  (cognitive  module)  and  the  top  layer  (social  module).  The  ac¬ 
tivities  of  each  layer  are  defined  over  the  object  world  which  is  directly  linked  to 
it  and  also  over  the  adjacent  layer  or  layers  shown  in  the  figure.  The  perception 
modules  in  each  layer  are  context  dependent  searchers  which  search  for  the  ne¬ 
cessary  information  from  the  affordances  maintained  in  the  object  world  which 
is  directly  linked  to  it.  For  example,  the  object  world  for  the  reactive  layer  is 
the  real  world  which  is  the  external  world  for  the  rational  agent.  The  perception 
modules  in  the  reactive  layer  perceive  the  real  objects  in  the  real  world,  and 
the  agent  makes  actions  in  the  world.  The  object  world  for  the  cognitive  layer 
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is  the  mental  world  of  the  rational  agent.  The  perception  modules  in  the  cogni¬ 
tive  layer  perceive  the  necessary  information  from  the  affordances  maintained  in 
the  mental  world,  and  according  to  the  results  of  perception  modules,  the  agent 
changes  its  mental  state  and  makes  intentions,  plans  and  decisions  for  actions 
performed  in  the  environment.  The  maintenance  of  the  mental  world  would  be 
also  performed  by  analyzing  the  current  state  of  the  mental  world. 

Agent  modules  and  auxiliary  submodules  described  above  are  organized  in 
the  three  abstraction  hierarchies,  i.e.,  aggregation,  generalization  and  associa¬ 
tion.  First,  we  can  define  an  aggregate  of  modules  in  such  a  way  that  it  exhibits 
a  collective  functionality  of  the  aggregate.  A  social  behavior  in  a  multiagent  sy¬ 
stem  is  a  typical  example.  Second,  the  generalization  (classification)  of  modules 
whatever  they  are  primitive  or  compound  is  also  considered  taking  account  of 
their  functionality,  i.e.,  perception  modules,  action  modules,  etc.  Third,  we  can 
define  an  association  of  aggregates  in  such  a  way  that  it  exhibits  a  collective 
group  functionality  of  the  association.  A  typical  example  is  seen  in  coordination 
and  cooperation  with  different  task  groups  of  agents.  A  selection  and  scheduling 
problem  for  different  plans,  i.e.,  selecting  a  set  of  appropriate  goals  and  then 
scheduling  the  selected  goals  is  also  another  example.  The  figure  3  summarizes 
these  three  abstraction  hierarchies.  As  seen  in  the  next  chapter,  KAUS  language 
facilitates  commands  for  describing  these  abstraction  hierarchies  in  the  coherent 
way. 


Fig.  3.  Organizing  modules 

We  note  here  that  in  the  practical  implementation  we  cannot  explicitly  de¬ 
scribe  all  members  of  an  association  because  the  cardinality  of  the  association 
set  is  exponential  (2“).  Because  of  this,  in  practical  applications  the  members  of 
an  association  would  be  given  generatively.  In  fact,  for  example,  a  candidate  set 
of  plans  is  to  be  generated  by  the  planning  module  if  required. 

3.3  Execution  Control  of  Modules 

As  for  the  execution  of  agent  modules  and  the  associated  submodules,  sequential, 
parallel  (concurrent),  synchronous  and  asynchronous  executions  are  considered. 
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The  autonomy  of  the  control  is  emerged  from  each  agent’s  interaction  with  the 
environment.  To  put  it  in  another  way,  the  affordances  described  in  the  agent 
modules  make  it  easy  to  produce  such  a  control  dynamics:  One  agent  module 
perceives  the  other  agent  module’s  affordances  and  then  reacts  or  makes  a  deli¬ 
berative  decision  for  the  next  action,  and  vice  versa.  The  frame  problems  would 
not  occur  in  such  autonomous  control  systems. 


4  KAUS  Language  as  a  System  Description  Language 


We  have  developed  KAUS  (Knowledge  Acquisition  and  Utilization  System)  lan¬ 
guage  from  our  motivation  that  a  language  based  on  sets  and  logic  would  be  ad¬ 
vantageous  for  modelling  intelligent  systems.  For  example,  in  intelligent  design 
systems  based  on  the  multiagent  technology,  we  are  required  to  describe  both 
structures  and  functionality  of  designed  objects  and  the  participated  agents. 

Using  KAUS  we  can  describe  object  structures  in  terms  of  set  hierarchies  ( isa , 
part-of  and  group-of  hierarchies  mentioned  in  3.2).  Agents  can  be  also  organized 
in  the  set  hierarchy.  On  the  other  hand,  we  describe  their  functionality  and 
relations  in  terms  of  the  extended  predicate  logic  in  which  primitive  procedural 
functions  are  incorporated.  Regarding  sets  as  types  of  variables,  we  can  write 
typed  logic  programs  in  KAUS  for  problem  solving.  The  following  shows  the 
syntax  of  KAUS  language  [15]. 


CLAUSE 

LITERAL 

PREFIX 


=  LITERAL  I  PREFIX  LITERAL  I  AND-OR.FORMULA  I  PREFIX  AND-0R_F0RMULA 
=  AT0HIC_F0RHULA  I  'ATOHIC.FORMULA 
=  [QUANTIFIER  VARIABLE  C_FLAG  /  TYPE]  I 
[QUANTIFIER  VARIABLE  C.FLAG  /  TYPE]  PREFIX 
...) 


AND-0R_F0RMULA  :  (CONNECTIVE  BODY  BODY 

AT0MIC_F0RMULA  (PREDICATE_NAME  ARCS) 

BODY  LITERAL  I  AND-OR-FORMULA 

TYPE  BASE^SET  I  POWERSET  I  COHPONENT.SET  I  NAMELESS_SET  I  VARIABLE 

ARGS  TERM  I  TERM  ARGS  I  TERM.  ARGS 

=  NTA_NAME  I  PTA_NAME 
=  BASE_SET 

=  $NAME 

=  A  I  E 

=  ?  I  #  I  ?# 


PREDICATE.NAME 

NTA.NAME 

PTA_NAME 

QUANTIFIER 

C_FLAG 

CONNECTIVE 


I  I 


A  clause  is  used  both  for  asserting  a  rule  or  a  fact  in  the  rule  base  and  for 
describing  a  goal  to  be  resolved  by  the  inference  system.  An  assertion  clause 
ends  with  the  period  (.)  and  a  query  clause  ends  with  the  question  mark  (?).  For 
example,  a  rational  agent  shown  in  Fig.  2  is  represented  as  follows. 

!ins_e  rational_agent  agent 1;  I*  define  agentl  is-a  rat i onal_ agent .  *| 

!ins_e  agentl : react ive_mod  rmodl;  |*  agentl  has  a  reactive_mod  rmodl.  *1 

!ins_e  agentl : cognitive_mod  cmodl;  |*  agentl  has  a  cognitive_mod  cmodl.  *1 

! ins_e  agentl :social_mod  smodl;  I*  agentl  has  a  social _mod  smodl.  *| 


[A  Agent s/*r at ional_agent] [A  Agent/Agents] [A  SocialActivity/socialActivity] 
(act  Agent  SocialActivity) .  |*  all  members  of  rational  agents  perform  a  social 
activity.  *1 


Note  that  in  the  above  example  expressions,  ^rational. agent  denotes  the 
powerset  of  rational  agents.  Hence  the  variable  Agents  is  declared  as  a  variable 
of  a  set  type. 
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The  KAUS  inference  system  is  composed  of  those  modules  shown  in  Fig. 
4.  Among  these,  the  elimination  of  tautology ,  type  check  and  world  operation 
characterize  the  KAUS  inference  system.  To  our  knowledge,  the  standard  Prolog 
and  its  extended  versions  do  not  have  these  inference  facilities. 


SLD  derivation 

Elimination  of  Tautology 

[ 

-B,Cl,C2,...,Cm 

+-A\,A2,~-AI,A3A4 

1 A2 . An 

i 

[ 

-A1,A2,...,An,C1,C2,...,Cm 

•*-  A2,A3,A4 

[ 

Negation  as  failure 

Delaying  Evaluation 

Constraint  Solver 

Type  Check 

Occur  Check 

World  Operations 

SLD:  Linear  resolution  for  Definite  clauses  with  Selection  function 


Most  general  unifier  (mgu)  for  asserted  term!  X  and  query  term2  Y  in  p(X)  and  p(Y)  : 

terml  (and) 

term2 

(are  unified  with)  mgu 

if 

A  example: 

vx/ti 

VY/t2 

\2/t2 

tl3t2 

given 

VX/tl 

3  Y/t2 

3  7J  tlr»t2 

tlnt2  *  4> 

[VX/boy]J3  Y/girl]likc(X,Y) 

3  X/tl 

3  Y/t2 

tiZ/tl 

tl£t2 

person  ro  boy 

VX/tl 

a 

a 

a  e  tl 

person  z)  girl 

a 

3  Y/t2 

a 

a  e  t2 

john  e  boy 

a 

b 

b 

a  =  b 

then  [3  Z/personJlikc(john,Z) 

Fig.  4.  Inference  modules  of  KAUS 

In  the  next  chapter  we  issue  the  problem  for  implementing  intelligent  agent 
systems  using  the  multi-strata  modelling  scheme  [8,9,10]. 

5  Representation  of  Multi-strata  Models  by  KAUS 

The  multi-strata  model  is  a  new  modelling  scheme  for  developing  human-compu¬ 
ter  interactive  problem  solving  systems.  For  example,  business  models  in  enter¬ 
prises  and  many  kinds  of  design  project  models  for  developing  new  products 
are  instances  of  the  multi-strata  model.  Fig. 5  shows  the  skeleton  structure  of 
the  multi-strata  model  and  its  KAUS  description.  In  the  figure,  Si,  S2  and  S3 
are  called  subjects  each  of  whom  undertakes  a  subtask  for  problem  solving.  A 
subject  may  be  either  a  human  or  computer.  For  example,  S3’ s  task  is  to  make 
the  object  model  (denoted  by  53  :  model )  of  the  object  53  :  obj.  53  :  obj  is 
explicated  by  the  lower  level  subject  S2.  The  subject  52  in  turn  undertakes  his 
own  task  specified  by  52  :  obj,  and  makes  the  model  of  it.  This  process  continues 
until  the  lowest  level  of  the  subject’s  task  is  clarified. 

We  can  describe  the  functionality  of  each  agent  (subject)  in  the  multi-strata 
model  using  the  Idefine^agent  command,  which  implements  the  agent  module 
given  in  Section  3.1.  For  example,  the  subject  52  is  described  like  followings: 

!ins_e  rational_agent  s2;  I*  s2  is-a  rational  agent.  *1 
!ins_e  human  s2;  I*  s2  is-a  human.  *| 

!def ine_agent  s2  { 

I  *  attribute  declarations  *  I 
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(privateData  name ("Jack  Jones"),  age(32),  sex(male)). 


I*  body  of  s2  *1 

(s2  type: designer  state ( [<0, idle>] ) , 

afford( [design (si : Requirement  Model) ,eval (Model, Status)] ) , 
perception(getValue(sl ,0bj)) , 
action (putValue (Ob j  , si  Requirement))  , 
communication(inform(Status, Agent)) 

). 

I  *  local  rules  maintained  by  s2  *  1 


>; 


’ia«_e  »l»bj 

ol; 

!ioa_e  ilunodsl 

ml; 

!in«_e  e2  ohj 

•  1; 

!ins_e  82:model 

m2; 

fins  e  s3:obj 

b2; 

!ins_e  s3:mode) 

Multi-Strata 

Model 


KAUS  internal  representation 


Fig.  5.  Multi-strata  model(skeleton  structure) 

6  Concluding  Remarks 

We  have  described  a  representation  scheme  for  agent  modules  which  is  available 
as  a  skeleton  for  real  application  modules  in  multiagent  systems.  The  point  in 
this  paper  is  that  we  have  applied  affordance  to  such  higher  levels  of  intelligent 
activities  as  cognition  and  deliberative  thinking  processes.  We  have  shown  that 
it  is  possible  to  avoid  the  frame  problems  existing  in  the  three  layers  by  embed¬ 
ding  the  notion  of  affordances  into  each  agent  module  which  is  allocated  in  the 
reactive,  cognitive  and  social  layers  respectively. 

Another  point  is  that  we  have  showed  KAUS  language  is  suitable  for  imple¬ 
menting  layered  systems.  However,  we  have  not  clarify  the  details  of  the  control 
structures  of  the  multiagent  systems  which  are  built  up  from  the  described  mo¬ 
dules.  This  subject  should  be  solved  in  future. 
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Abstract  In  this  paper  we  have  considered  the  problem  of  approximating  an 
underlying  distribution  by  one  derived  from  a  dependence  polytree.  This  paper 
proposes  a  formal  and  systematic  algorithm,  which  traverses  the  undirected  tree 
obtained  by  the  Chow  method  [2],  and  by  using  the  independence  tests  it  suc¬ 
cessfully  orients  the  polytree.  Our  algorithm  uses  an  application  of  the  Depth 
First  Search  (DFS)  strategy  to  multiple  causal  basins.  The  algorithm  has  been 
formally  proven  and  rigorously  tested  for  synthetic  and  real-life  data. 


1 .  Introduction  and  Background 

Over  the  last  decade  Bayesian  learning  principles  have  received  a  fair  amount  of  at¬ 
tention.  Although  they  are  elegant,  they  usually  involve  summations  or  integrals  along 
all  possible  instantiations  of  the  parameters  and  along  all  possible  models.  In  the  case 
of  learning  of  Bayesian  networks  (which  is  distinct  from  Bayesian  Learning  itself)  this 
can  be  perceived  as  a  discrete  optimization  problem  [5].  Precise  solutions  of  this  can 
be  obtained  by  using  search  if  we  assume  that  there  are  only  a  few  relevant  models. 
This  has  proven  to  be  the  method  of  choice  in  many  real-life  applications  [1], 

Many  of  the  Bayesian  models,  which  are  studied,  are  intractable.  The  challenge  is 
to  find  general-purpose,  tractable  approximation  algorithms  for  reasoning  with  these 
elegant  and  expressive  stochastic  models.  For  example,  if  we  are  to  use  Bayesian 
learning  to  improve  performance  of  distributed  database  applications  where  there  can 
be  millions  of  transactions  every  day,  we  will  need  an  efficient  technique  to  build  a 
model  of  the  use  of  the  database.  The  belief  network  that  underlies  the  Bayesian 
learning  is  at  the  heart  of  the  approach.  The  connection  between  Bayesian  learning  and 
belief  networks  is  that  one  can  use  Bayesian  techniques  to  induce  a  belief  network 
referred  to  as  a  Bayesian  Belief  Network  (BBN).  Often,  due  to  the  lack  of  domain 
knowledge  and  in  the  interest  of  simplicity,  it  is  assumed  that  the  underlying  structure 
is  in  a  particularly'  simple  form,  representing  reciprocal  independence  of  variables 
involved.  This  results  in  a  simple  variant  of  Bayesian  learning  called  Naive  Bayes. 
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The  learning  benefits  if  a  more  comprehensive  and  causal  model  of  interaction 
between  the  variables  is  available.  Such  a  model,  represented  as  a  Bayesian  network, 
plays  the  role  of  a  restricted  hypothesis  bias  [9],  The  method  allows  us  to  obtain  the 
approximating  probability  distribution  P(X)  by  a  well-defined  and  easily  computable 
density  function  Pa(X).  Indeed,  it  is  impractical  to  store  all  estimates  of  the  joint  func¬ 
tion  P(X)  for  all  possible  values  of  the  vector  X.  Our  goal  is  to  build  a  probabilistic 
network  from  the  distribution  of  the  data,  which  adequately  represents  it.  Once  con¬ 
structed,  such  a  network  can  provide  insight  into  probabilistic  dependencies  that  exist 
between  the  variables. 

In  order  to  measure  the  “goodness”  of  the  approximation,  an  information  theoretic 
measure  can  be  specified  in  terms  of  the  Kullback-Leibler  [8]  cross-entropy  metric  to 
compare  joint  probability  distributions.  Chow  and  Liu  [2]  used  this  measure  to  ap¬ 
proximate  discrete  distributions  by  collecting  the  entire  first  and  second  order  margi¬ 
nals.  They  derived  a  relationship  between  the  measure  of  closeness  between  the  prob¬ 
abilities  and  the  measure  of  independence  between  all  the  pairs  of  the  variables.  A 
Maximum  Weight-Spanning  Tree  (MWST)  called  the  “Chow  Tree”  was  built  using 
the  information  measure  between  the  variables  forming  the  nodes  of  the  tree.  An  al¬ 
ternative  method  of  obtaining  such  a  tree  using  the  %2  metric  was  later  proposed  by 
Valiveti  and  Oommen  [17].  A  subsequent  work  due  to  Rebane  and  Pearl  [15]  used  the 
Chow  Tree  as  the  starting  point  of  an  algorithm  which  builds  a  polytree  (singly  con¬ 
nected  network)  from  a  probability  distribution.  This  algorithm  orients  the  Chow  tree 
by  assuming  the  availability  of  independence  tests  on  various  multiple  parent  nodes. 

The  works  of  Rebane  and  Pearl  [  15]  are  commendable.  Although,  as  we  shall  see, 
they  did  not  answer  all  the  questions  regarding  polytrees,  their  inference  and  their 
characterization,  in  our  opinion  their  work  was  pioneering  (i.e.,  with  regard  to  polytree 
representations)  and  represented  a  quantum  jump  since  the  work  done  on  trees  in  the 
late  1960’s.  In  our  opinion,  their  most  fundamental  contribution  was  to  discover  and 
utilize  the  edge  "orienting"  principle  [18]  referred  to  later. 

Numerous  authors  have  built  on  the  foundation  of  the  work  of  Rebane  and  Pearl. 
Noteworthy  are  the  results  of  Srinivas  et  al.,  [16]  who  worked  with  independence,  and 
the  recent  results  of  Dasgupta  [3]  which  explicitly  specifies  the  complexity  of  the 
underlying  problem.  Friedman  has  also  worked  in  the  area  and  has  modified  the  tradi¬ 
tional  EM  algorithm  to  devise  the  "Structural"  EM  algorithm  [4]  to  learn  BBNs,  and 
also  demonstrated  how  one  can  learn  Bayesian  Networks  from  massive  data  sets  using 
the  "Sparse  Candidate  Algorithm"  [16]. 

Much  of  the  current  work  has  made  substantial  progress  on  learning  the  structure  of 
multiply  connected  networks  and  dynamic  Bayesian  networks,  and  this  has  even  been 
achieved  in  the  presence  of  hidden  variables  and  for  real  data  sets  for  which  perfect 
independence  tests  are  not  realistic. 

This  paper  deals  with  the  problem  of  automatically  building  a  belief  network  in 
terms  of  a  directed  polytree  with  the  assumption  that  the  observations  have  been  pre¬ 
sented  to  the  system  in  terms  of  joint  probability  distributions.  Thus  we  assume  that 
the  joint  dependence  relations  represented  by  these  observations  is  available.  Pearl 
[14]  discussed  this  process  using  a  two-phase  dependence  learning  scheme.  Our  aim  is 
to  find  causal  poly  tree  structures  that  fit  the  data  presented  in  terms  of  joint  probability 
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distributions.  The  question  of  inferring  the  poly  tree  structure  from  the  data  (as  op¬ 
posed  to  the  data  distribution)  is  the  study  of  a  subsequent  paper  presently  being  com¬ 
piled  and  is  described  in  [1 1], 

The  reader  must  observe  that  polytrees  represent  much  richer  dependency  models 
than  undirected  trees,  because  their  joint  probability  density  functions  are  products  of 
higher  order  distributions.  Consequently,  the  problem  itself  can  be  shown  to  be  a  much 
harder  problem  than  that  of  finding  the  best  tree  [14].  First  of  all,  the  algorithm  is  not 
guaranteed  to  find  a  polytree  structure  if  the  underlying  distribution  is  degenerate  and 
not  of  a  polytree  type  distribution  i.e.,  if  the  distributions  do  not  fit  into  a  polytree 
representation.  Secondly,  the  algorithm  relies  on  the  repeated  use  of  the  independence 
tests  that  determines  categorically  whether  two  random  variables  Xi  and  are  statisti¬ 
cally  independent.  As  shown  in  [12],  even  if  the  random  variables  are  statistically 
independent,  the  experimental  evaluation  may  never  yield  conclusive  independence 
decisions. 


1.1  Problem  Statement  and  Outline  of  Solution 

If  we  consider  the  polytree  construction  algorithm  as  given  by  [15]  we  observe  that  the 
order  of  traversing  the  tree  so  as  to  orient  it  is  unspecified.  The  implementation  strat¬ 
egy  of  determining  the  order  of  dependence  tests  is  unanswered  and  left  to  the  reader. 
In  this  paper  we  shall  formally  develop  an  algorithm  which  answers  these  questions. 
First  of  all,  the  algorithm  determines  the  network  structure  or  the  tree  using  the 
MWST  algorithm  described  earlier,  and  subsequently  orients  the  tree  by  beginning 
with  the  assumption  that  we  have  marginal  independence  between  at  least  two  parents 
of  any  node.  Thus  if,  there  is  no  independence  of  any  two  parents  of  a  node  the  algo¬ 
rithm  will  terminate  by  informing  the  user  that  the  underlying  tree  structure  cannot  be 
oriented  to  yield  a  polytree. 

The  problem  of  orienting  the  tree  is  solved  in  two  steps.  The  first  step  identifies  all 
the  independencies.  In  fact,  every  two  nodes  X,  and  Xt  are  independent  if  the  following 
equality  is  satisfied:  P(X,,  X)  =  P(X,)  *  P(X). 

Although,  as  mentioned  above,  this  equality  is  not  always  satisfied  with  sample 
data,  in  this  paper,  we  assume  that  such  independence  inferences  are  available.  We 
also  assume  that  we  are  provided  with  this  information  whenever  it  is  requested.  The 
second  step  is  as  follows:  after  inferring  all  the  statistical  independence  between  the 
pairs  of  variables,  we  use  the  following  Orienting  Principle  (T)  due  to  Pearl  et  al.  [14] 
to  completely  orient  the  tree. 

Orientins  Principle  ( T ): 

For  every  unoriented  triplet  of  variables  X,  Y  and  Z  ordered  as:  X — Z  — Y, 
we  test  for  the  independence  of  X  and  Y.  If  X  and  Y  are  independent  then  X  is 
a  parent  of  Z  and  Y  is  a  parent  of  Z.  For  any  triplet  X,  Y  and  Z  such  that:  X— > 
Z  - —  Y,  we  test  if  X  and  Y  are  independent,  and  if  this  is  so  Y  is  parent  of  Z 
otherwise  Y  is  a  child  of  Z. 
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The  details  of  why  this  principle  works  is  omitted  here  but  explained  in  greater  detail 
in  [10]  and  [13].  Utilizing  all  this  information  we  shall  show  that  the  polytree  can  be 
efficiently  computed  if  the  underlying  tree  structure  is  systematically  traversed. 


2.  A  Depth  First  Search  Algorithm  for  Building  Polytrees 

Our  algorithm  for  inducing  the  polytree  is  an  application  of  the  Depth-First  Search 
(DFS)  algorithm  to  causal  components  of  the  undirected  tree.  Let  T  =  (V,  E)  be  a 
connected,  undirected  tree  where  V  is  the  set  of  vertices,  and  E  the  set  of  edges. 

A  vertex  Z  is  said  to  be  an  articulation  point  between  vertices  X  and  Y  if  we  have 
independence  between  X  and  Y.  As  defined  by  Pearl  [14]  and  used  by  all  the  re¬ 
searchers  since,  a  Causal  Basin  starts  with  a  multi-parent  cluster  (a  child  node  and  all 
of  its  direct  parents)  and  continues  in  the  direction  of  causal  flow  to  include  all  of  the 
child’s  descendants  and  all  of  the  direct  parents  of  those  descendants. 

An  example  of  this  is  given  in  the  Figure  1. 


Fig.  1.  A  Causal  Basin  as  defined  by  Pearl  [14]. 


2.1  Problems  with  Pearl’s  Algorithm 

Although  the  above  definition  is  consistent,  there  are  some  unanswered  questions 
which  arise  from  the  work  of  Rebane  and  Pearl  [15].  In  fact,  although  they  specify  a 
formal  algorithm  to  compute  the  causal  basins,  they  leave  the  following  questions 
unanswered: 

1.  The  question  of  what  is  meant  by  the  outermost  layer  is  not  clear  since  it  “de¬ 
pends  on  the  tree”  and  its  representation. 

2.  The  question  of  how  the  traversal  is  done  is  not  completely  defined. 

3.  The  algorithm  introduces  ambiguity  regarding  the  edges  that  are  already  trav¬ 
ersed. 

4.  The  notion  of  causal  basins  depends  on  the  starting  point. 

The  last  of  these  issues  can  be  seen  from  the  following  figure  in  which  the  Chow  tree 
of  this  poly  tree  is  taken  from  [14]. 
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Fig.  2.  Three  Causal  Basins  starting  respectively  at  nodes  H,  K  and  C. 


Observe  that  the  Chow  tree  of  Fig.  1  and  Fig.  2  is  the  same.  In  Fig.  1,  the  first  ar¬ 
ticulation  point  (starting  point)  is  node  C  and  the  second  articulation  point  is  node  K. 
Having  this  order  for  the  choice  of  the  starting  points  we  detect  two  causal  basins  as 
given  in  Fig.  1.  In  Fig.  2,  we  use  node  H  as  the  starting  point,  node  C  as  the  second 
and  node  K  as  the  third  to  be  able  to  complete  the  same  orientation  of  the  Chow  tree  as 
in  Fig.  1  using  three  causal  basins  instead  of  two.  From  this  it  is  easy  to  see  that  the 
starting  point  determines  the  individual  causal  basins. 

2.2  Motivation  for  a  DFS  Strategy 

Consider  the  process  of  visiting  the  vertices  of  an  undirected  tree  in  the  following 
manner.  We  select  and  “visit”  a  starting  vertex  Z  which  is  one  of  the  articulation 
points  in  T  and  in  particular,  an  articulation  point  between  two  nodes  X  and  Y.  First  of 
all  we  orient  the  edges  (X,  Z)  and  (Y,  Z)  as  pointing  to  node  Z  following  the  orienting 
principle  since  we  have  independence  between  them.  Then  we  select  any  edge  (Z,  W) 
incident  upon  Z.  We  check  for  independence  between  nodes  X  and  W  to  determine  the 
orientation  of  edge  (Z,  W).  We  observe  two  possible  scenarios:  If  there  is  no  inde¬ 
pendence  between  X  and  W  then  edge  (Z,  W)  is  pointing  to  node  W.  We  then  visit 
node  W  and  begin  to  search  for  a  new  edge  starting  at  vertex  W.  After  completing  the 
search  through  all  causal  paths  beginning  at  W,  the  search  returns  either  to  Z,  the  ver¬ 
tex  from  which  W  was  first  reached,  to  search  through  all  nodes  in  the  adjacency  list 
of  Z,  or  to  another  non- visited  articulation  point.  If  there  is  independence  between  X 
and  W  the  edge  (Z,  W)  is  pointing  to  node  Z,  and  the  search  returns  either  to  Z,  the 
vertex  from  which  W  was  first  reached,  to  search  through  all  the  nodes  in  the  adja¬ 
cency  list  of  Z,  or  to  another  non- visited  articulation  point. 

The  process  of  selecting  unexplored  edges  incident  on  Z  is  continued  until  this  list 
is  exhausted.  This  is  formalized  in  the  algorithm  Polytree-Depth-First-Search. 

The  input  to  the  algorithm  is  mainly  the  set  of  nodes,  and  for  every  node  X,  we 
specify  its  “adjacency”  list  which  consists  of  a  list  of  nodes  X  such  that  arc  X,  — X 
exists  in  the  tree  structure  of  the  underlying  tree.  Also  provided  are  the  independence 
tests  between  nodes  whenever  required.  The  algorithm  is  formally  given  below. 
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Algorithm  Polytree-Depth-First-Search 
Input:  A  tree  T=(V,  E). 

Independence  test  available  for  every  pair  of  nodes  when  it  is  required. 

For  every  node,  a  list  of  all  its  direct  neighbours  specified  as  a  connected 
List  specified  as  ConList.  We  assume  that  the  test  for  a  node  being  an 
articulation  point  is  a  straightforward  operation.  Also,  a  node  W  is  in  the 
causal  basin  of  X  if  there  is  a  path  from  X  to  W. 

Output:  A  directed  polytree  if  the  orientation  exists.  It  returns  T,  the  undirected  tree 
if  any  orientation  is  not  possible. 

Method 

Begin 

For  (all  X  in  V)  Do 

Visited  [X]  =  false  /*  Visited  is  an  array  holding  the  nodes  */ 

EndFor 

For  (all  X  in  V)  Do 

/*  always  start  with  an  articulation  point  */ 

If  (  !Visited[X]  and  X  is  an  articulation  point)  Then 
Call  Processing  (X) 

Endlf 

EndFor 

End  Algorithm  Polytree-Depth-First-Search 


Procedure  Processing  (X) 

Begin 

/*  node  X  is  not  a  leaf  */ 

If  ((Visited[X]  =  false )  and  (ConList(X))  >  1))  Then 
/*  orient  the  adjacent  edges  *1 
Call  IndepOrient(X) 

Visited[X]  =  true 

/*  traverse  the  adjacency  list  of  X  */ 

For  (all  W  in  the  ConList  of  X  and  W  is  in  causal  basin  of  X)  Do 
/*  Processing  is  recursive  because  of  Depth-First  Search  */ 
Call  Processing(W) 

EndFor 

Endlf 

End  Algorithm  Processing 

The  above  algorithm  uses  a  DFS  strategy.  It  is  easy  to  devise  an  analogous  algo¬ 
rithm,  which  uses  a  Breadth-First  Search  strategy,  or  any  systematic  search  scheme 
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Procedure  IndepOrient  (X) 

Begin 

For  (every  distinct  N,  and  N2  in  ConList(X))  Do 
If  Indep(N,,  N2)  =  True  Then 

print  arcs  from  (N,  to  X)  and  (N2  to  X) 

Endlf 

EndFor 

For  (every  distinct  N,  and  N2  in  ConList(X))  Do 

If  (arc  from  N,  to  X  exists  and  edge  from  N2  to  X  is  unoriented)  Then 
print  arc  from  X  to  N, 

Else  If  (arc  from  N2  to  X  exists  and  edge  from  N,  to  X  is  unoriented)  Then 
print  arc  from  X  to  N, 

Endlf 

EndFor 

End  Algorithm  IndepOrient 


2.3  Analytic  Properties  of  the  Algorithm 

The  formal  proof  that  the  above  algorithm  works  follows  the  arguments  of  the  DFS 
traversal  of  a  graph. 

Theorem  1: 

The  algorithm  Polytree-Depth-First-Search  correctly  computes  the  polytree  given  the 
skeleton  tree  structure  and  the  underlying  independence  relationships. 

Proof:  The  proof  is  done  inductively  and  found  in  [10].  ♦  ♦  ♦ 

We  shall  now  state  results  regarding  the  complexity  of  the  above  algorithm.  This  is 
done  by  considering  the  number  of  independence  tests  that  need  to  be  performed  to  be 
able  to  orient  the  tree. 

Theorem  2: 

For  a  tree  in  which  every  node  has  up  to  k  adjacent  nodes,  at  depth  d  in  the  tree,  the 

total  number  of  independence  tests  which  have  to  be  done  is:  k*(k  —  l)d  1  * 

Proof:  The  proof  is  not  too  involved,  but  follows  from  an  induction  on  the  size  of  the 
DFS  tree.  It  is  omitted  here,  but  found  in  the  unabridged  paper  [10].  ♦  ♦  ♦ 

From  a  straightforward  examination  the  overall  burden  of  the  computation  can  be 
obtained  by  observing  that  for  each  node  we  have  to  do  pairwise  independence  tests 
between  its  neighbors.  Theorem  2  explains  what  happens  at  every  level  of  the  tree. 
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3.  Experimental  Results 

In  order  to  check  the  algorithm  developed  in  this  paper,  we  have  done  numerous  ex¬ 
periments.  We  assumed  that  we  were  to  learn  an  underlying  polytree,  which  is  un¬ 
known  to  the  algorithm.  Also,  we  assumed  that  independence  tests,  which  are  consis¬ 
tent  with  the  polytree  orientation,  were  available  whenever  they  were  needed  by  the 
algorithm.  The  polytree  learning  algorithm  applicable  for  discrete  data  (called  Dis¬ 
crete-Polytree-Depth-First-Search)  was  invoked  using  the  skeleton  and  the  sequence 
of  independence  tests. 

The  entire  project  involved  the  testing  of  the  algorithm  for  numerous  trees  and 
polytrees.  The  details  of  the  results  are  omitted  here  but  included  in  [10],  [13].  The 
descriptions  in  [10],  [13]  include  specific  polytrees,  and  clearly  demonstrate  how  the 
structure  is  learnt  as  the  independence  tests  are  provided.  The  test  cases  also  show 
examples  in  which  the  polytrees  have  one  or  multiple  causal  basins.  In  every  case,  the 
polytree  was  exactly  inferred. 

Our  experimental  results  consistently  demonstrate  that  the  algorithm  successfully 
orients  the  polytree.  However,  if  the  independence  information  for  any  pair  of  nodes  is 
not  provided  to  the  algorithm,  the  algorithm  will  terminate  without  orienting  the  tree. 
The  advantages  of  our  algorithm  are  numerous;  first  of  all  it  is  computationally  effi¬ 
cient  since  it  uses  a  DFS  scheme.  It  also  extends  itself  to  both  discrete  and  continuous 
variables,  and  provides  a  very  efficient  way  of  traversing  the  tree.  Finally,  we  mention 
that  the  scheme  also  handles  multi-feature  variables.  It  has  been  used  quite  success¬ 
fully  in  a  real-life  application  where  the  problem  is  to  improve  performance  in  systems 
using  repeated  queries  which  access  distributed  databases  [13].  The  scheme  has  also 
been  used  for  the  ALARM  data  [7]. 


4.  Conclusion 

In  this  paper  we  have  considered  the  problem  of  approximating  an  underlying  distri¬ 
bution  by  one  derived  from  a  dependence  polytree.  The  skeletal  form  of  the  polytree  is 
known  to  be  the  MWST  of  a  complete  graph,  with  I/X^  X),  the  information  theoretic 
metric,  as  the  edge  weight  between  the  pair  of  nodes  X,  and  X,.  Once  the  tree  is  de¬ 
rived,  Rebane  and  Pearl,  in  [15],  proposed  to  use  an  independence  test  to  determine  if 
a  variable  has  multiple  parents.  They  dictated  that  every  two  node  neighbors  X  and  Y 
of  a  node  Z  must  be  tested  for  marginal  independence  to  decide  if  Z  has  parents  X  and 
Y. 

This  paper  proposes  a  formal  and  systematic  algorithm  to  traverse  the  tree 
obtained  by  the  Chow  method.  It  uses  an  application  of  the  DFS  strategy  to  multiple 
causal  basins.  Experimental  results  clearly  demonstrate  that  when  the  required  inde¬ 
pendence  tests  are  available  to  the  algorithm  the  orientation  of  the  polytree  is  com¬ 
pleted,  and  always  correct.  The  algorithm  has  also  been  used  in  two  real-life  applica¬ 
tions  [13]  involving  distributed  databases  and  the  ALARM  data. 
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Abstract.  This  article  shows  that  cartographic  generalization  is  best 
viewed  as  representing  (formulating,  renaming  knowledge)  and  abstracting 
(simplifying  a  given  representation).  The  general  process  of  creating  map 
is  described  so  as  to  show  how  it  fits  into  an  abstraction  framework 
developed  in  artificial  intelligence  to  emphasize  the  difference  between 
abstraction  and  representation.  The  utility  of  the  framework  lies  in  its 
efficiency  to  automate  knowledge  acquisition  for  the  cartographic 
generalization  as  a  combined  acquisition  of  knowledge  for  abstraction  and 
knowledge  for  changing  a  representation. 


1  Introduction 

In  this  paper  we  address  the  problem  of  automating  cartographic  generalization.  This 
automation  is  needed  for  several  reasons:  first  to  decrease  cost  and  time  necessary  to 
produce  maps,  then  to  allow  geography  experts  who  are  not  necessary  cartography 
specialists  to  create  their  own  maps  with  a  good  quality,  and  finally  to  facilitate  the 
crucial  need  of  multi-level  analysis  of  geographic  data. 

The  lack  of  efficient  generalization  tools  in  GIS  is  due  to  the  fact  that 
generalization  is  a  difficult  task:  it  is  guided  by  a  lot  of  geographic  and  cartographic 
knowledge.  An  approach  to  face  this  need  for  automation  is  to  build  expert  systems 
that  have  proved  to  be  efficient  in  numerous  fields  where  knowledge  require  to  be 
introduced.  Many  authors  emphasize  that  the  main  problem  for  the  use  of  expert 
systems  is  the  «knowledge  acquisition  bottlenecks 

Moreover,  the  analysis  of  first-generation  expert  systems  0  stress  the  need  to 
differentiate,  separate,  and  structure  the  different  types  of  knowledge  in  second- 
generation  expert  systems.  We  present  in  this  article  a  description  of  the  knowledge 
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used  in  cartographic  generalization  well  fitted  to  its  acquisition.  We  analyze 
generalization  along  two  dimensions:  knowledge  abstraction  and  knowledge 
representation,  as  proposed  by  0.  This  distinction  is  necessary  to  differentiate,  and 
so  acquire,  the  different  knowledge  types  involved  in  generalization 

2  Differentiating  Representation  and  Abstraction 

Representing  knowledge  is  one  of  the  main  research  topic  in  Artificial  Intelligence 
since  its  birth.  The  AI  community  has  come  out  in  the  past  fifty  years  with  a  large 
variety  of  languages  that  are  each  more  or  less  adapted  to  represent  different  field  of 
humans  knowledge  that  require  to  be  represented  and  processed  0.  Although  a  large 
amount  of  human  expertise  can  be  formulated  as  a  set  of  specific  procedures  or 
inferences  in  one  given  language  or  paradigm,  the  cartographic  generalization 
process  clearly  requires  several  knowledge  representation  languages  to  capture  the 
different  types  of  knowledge  manipulated,  ranging  from  the  raw  data  from  the  world 
to  their  final  representation  as  a  usable  map. 

Saitta  and  Zucker  have  recently  proposed  a  model  of  abstraction  (hereafter  called 
the  KRA  model),  supporting  reasoning  in  a  wide  context  0.  They  distinguish  two 
fundamental  processes,  namely  the  process  of  changing  the  language  of 
representation  and  the  process  of  abstracting  the  language  of  representation. 

The  KRA  model  originates  from  the  observation  that  the  conceptualization  of  a 
domain  involves  four  different  levels.  Underlying  any  source  of  experience  there  is 
the  world  (W),  where  concrete  objects  reside.  However,  the  world  is  not  really 
known,  because  we  only  have  a  mediated  access  to  it,  through  our  perception.  Then, 
what  is  important  for  an  observer  is  not  the  world  in  se,  but  the  perception  P(W)  that 
s/he  has  of  it.  At  this  level  the  percepts  «exist»  only  for  the  observer  and  only  during 
their  being  perceived.  Their  reality  consists  in  the  «physical»  stimuli  produced  on  the 
observer.  In  order  to  let  these  stimuli  become  available  over  time,  for  retrieval  and 
further  reasoning,  they  must  be  first  of  all  memorized  and  organized  into  a  structure 
S.  This  structure  is  an  extensional  representation  of  the  perceived  world,  in  which 
stimuli  related  one  to  another  are  stored  together  into  tables.  The  set  of  these  tables 
constitutes  a  relational  database,  on  which  relational  algebra  operators  can  be 
applied.  Finally,  in  order  to  symbolically  describe  the  perceived  world,  and  to 
communicate  with  other  agents,  a  language  L  is  needed.  L  allows  the  perceived 
world  to  be  described  intensionally .  Finally,  a  theory  T  might  be  needed  to  reason 
about  the  world.  The  theory  may  also  contain  general  knowledge,  which  does  not 
belong  to  the  specific  domain,  and  allows  inferences  to  be  drawn.  At  the  theory  level 
we  operate  through  inference  rules.  Let  us  define  R  =  <  P(W),  S,  L,  T  >  as  a 
Reasoning  Context,  The  relationships  among  the  four  considered  levels  are 
represented  in  Figure  1 . 

There  is  an  infinity  of  ways  in  which  the  world  can  be  perceived  by  an  intelligent 
agent,  according  to  the  observation  tools,  the  goal  of  the  observation,  the  agent’s 
cultural  background,  and  so  on.  This  variability  is  captured  by  the  diversity  of  the 
world  perceptions  P(W).  It  is  at  this  layer  that  the  type  and  amount  of  information 
the  agent  will  memorize,  speak  about,  and  reason  about  later  is  established.  The  less 
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detailed  the  perception,  the  more  abstract.  Sometimes  the  agent  has  control  over  the 
perception  in  such  a  way  to  collect  exactly  the  information  it  needs  to  achieve  its 
goals.  Sometimes,  the  agent  can  not  control  the  perception,  so  that  it  may  receive 
much  more  information  than  it  currently  needs,  or  maybe  it  wants  to  perform  several 
tasks,  each  one  requiring  different  pieces  of  information,  which,  on  the  other  hand, 
are  easy  to  collect  together.  The  preceding  considerations  suggest  that  it  would  be 
very  useful  to  have  methods  to  actually  or  virtually  transform  a  perception  into  a 
more  abstract  one.  The  following  definition  of  abstraction  tries  to  capture  this 
process. 


representation 


Figure  2  -  Changing  levels  of 


details 


Definition  -  Given  a  world  W,  let  Rg  -  (Pg(W),  Sg,  Lg,  Tg)  and  Ra  =  (Pa(W), 
Sa,  La,  Ta)  be  two  reasoning  contexts,  which  we  label  as  ground  and  abstract.  An 
Abstraction  is  a  functional  mapping  A  :  Pg(W)  — >  Pa(W)  between  a  perception 
Pg(W)  and  a  simpler  perception  Pa(W)  of  the  same  world  W. 

Some  comments  are  needed  about  this  definition.  In  0  a  formal  definition  of 
“simpler”  in  terms  of  relative  information  gain  has  been  given.  Obviously,  the 
process  of  abstracting  a  perception  can  be  iterated,  leading  to  several  levels  of 
abstraction.  If  no  perception  can  be  identified  as  a  preeminent  one,  then  any  level  can 
be  selected  as  “ground”,  being  the  notion  of  “simpler”  only  a  relative  one.  Another 
important  point  is  that  abstraction  should  be  a  reversible  process;  in  fact,  to  abstract 
does  not  mean  to  delete  information,  but  only  to  hide  information,  in  such  a  way 
that  the  opposite  process  (concretion)  becomes  possible,  as  well.  Finally,  according 
to  this  definition,  the  abstraction  process  starts  at  the  perception  level,  but  propagates 
toward  the  layers  of  Figure  1.  However,  the  abstraction  relations  between  the 
structures,  the  languages  and  the  theories  are  determined  by  the  relations  defined  on 
the  perceptions. 
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In  Figure  2,  the  view  on  abstraction  presented  in  this  paper  is  synthetically  described. 
The  symbols  co,  a,  X  and  i  denote  abstraction  operators  working  between  entities  of 
the  same  layer. 

The  perceptual  stimuli  of  the  perceived  world  are  classified  according  to 
categories  that  are  proved  useful  to  organize  information  (including  ontologies, 
objects,  attributes,  functions,  and  relations).  Within  the  KRA  model  framework  0,  a 
set  of  fundamental  abstraction  operators  have  been  defined.  These  operators  are 
defined  at  the  perceived  world  P(W)  level.  The  proposed  set  of  fundamental 
operators  &.=  {G\ide,  ca,nd,  cose„  co,a,  coeqva„  cored.arg,  coprop }  is  not  exhaustive.  In  particular 
contexts  (such  as  cartography),  it  can  be  either  reduced  or  augmented  with  domain- 
specific  abstraction  types. 

The  operator  cq,ide  is  a  fundamental  abstraction  operator  that  consists  in  hiding  any 
kind  of  knowledge  be  it  an  object,  an  attribute  of  an  object  or  a  relation  between 
objects.  For  example  an  isolated  street  may  be  hidden.  The  operator  oind  consists  in 
making  several  objects  indistinguishable.  For  example  only  one  of  several  close 
isolated  trees  will  be  considered  as  a  typical  representative  of  them.  The  operator  co^ 
consists  in  grouping  several  objects  that  are  considered  not  to  be  distinguished.  For 
example  the  grouping  of  a  set  of  trees  into  an  object  "forest".  The  operator  coia 
consists  in  grouping  a  set  of  different  objects  to  form  a  new  compound  object.  For 
example  grouping  several  streets  and  buildings  to  form  a  town.  The  operator  cotq  val 
specifies  what  subset  of  the  attribute  or  function  values  can  be  merged,  because  they 
are  considered  indistinguishable.  For  example,  two  objects  with  close  altitudes  will 
be  considered  at  the  same  altitude.  Finally,  the  operators  cored  arg  specifies  a  relation 
and  a  subset  of  its  arguments,  which  must  be  dropped  from  the  relation,  obtaining 
thus  a  relation  with  reduced  arity.  For  example  the  argument  “type  of  crossing”  may 
be  hidden  in  the  relation  between  two  roads.  From  these  operators  defined  at  the 
perception  level,  the  operators  between  the  structures,  languages  and  theories  are 
deduced. 

3  Cartography  in  the  KRA  Model 

The  KRA  model  exhibits  several  key  properties  for  cartography.  It  allows  to 
distinguish  the  process  of  representation  (change  of  language)  from  the  process  of 
abstraction  (change  of  level  of  detail).  These  two  processes  are  usually  very  much 
entangled  in  cartography.  This  distinction  provides  the  basis  for  automating 
knowledge  acquisition  in  cartography,  as  a  combined  acquisition  of  knowledge  for 
abstraction  and  knowledge  for  changing  representation,  as  we  will  explain  in  the 
following  section. 

The  topographic  map  production  process  closely  parallels  the  KRA  model,  because  it 
can  be  analyzed  according  to  the  two  dimensions,  representation  and  abstraction.  Let 
us  first  consider  the  scheme  of  Figure  1  applied  to  cartography.  The  first  step  of 
cartography  is  to  collect  data  from  the  geographic  world,  or  part  of  it  (W).  This  is 
usually  done  through  aerial  photographs  or  satellite  images.  These  data  are  the 
perceived  world  P(W).  Objects  contained  in  these  photographs  are  located  and 
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labeled  to  create  a  geographic  database  (GDB).  This  GDB  is  the  set  of  geographic 
data  organized  in  a  Structure  (S).  Then,  this  GDB  is  displayed  by  means  of 
cartographic  symbols  applied  to  objects  stored  in  it.  This  is  the  creation  of  a  map,  an 
iconic  language  (L).  Finally,  maps  are  created  for  tasks  like  space  analysis,  search 
for  itineraries,  town  planning,  or  geographic  theory  construction.  The  theory  T 
contains  all  the  background  facts  and  laws  allowing  to  reason  about  geographic 
configuration,  and  may  be  different  for  different  tasks. 

Cartography  is  not  just  knowledge  representation.  All  the  steps  of  map  creation 
do  not  involve  only  a  knowledge  representation,  but  also  involve  a  knowledge 
abstraction.  In  particular,  Map  creation  (which  contains  the  generalization  process) 
is  both  a  knowledge  representation  process,  when  objects  are  symbolized,  and  a 
knowledge  abstraction  process,  when  objects  relevant  to  the  theory  construction  are 
identified.  So  described,  map  creation  is  represented  as  a  diagonal  process  in  Fig.  3. 


Generalization  Process  in  the  KRA  Model 

Knowledge  abstraction  in  generalization  is  the  identification  of  abstracted 
geographic  objects  relevant  to  the  theory  construction  that  will  be  done  from  the 
map.  "Objects"  have  to  be  taken  here  in  a  very  wide  sense:  they  may  represent  any 
basic  geographic  objects  (like  a  house,  a  road...)  or  any  set  of  basic  objects  having  a 
geographical  meaning  (the  set  of  streets  of  town,  a  street  and  the  buildings  along 
it...). 

In  our  model,  the  abstraction  process  is  to  go  from  a  detailed  description  of  a 
geographic  object,  describing  each  part  of  the  object,  to  a  more  abstract  description 
of  the  object,  describing  only  properties  of  the  object  relevant  to  the  map  users 
needs.  For  example,  an  abstraction  is  to  go  from  a  complete  description  of  a  set  of 
streets  in  a  town  to  the  description  «this  is  a  streets  networks 

As  we  explained  in  the  KRA  model  presentation,  abstractions  at  the  structure  level 
(i.e.  on  objects  of  the  geographic  database)  shall  only  be  considered  as  consequences 
abstractions  at  the  geographic  world  perception  level. 

Knowledge  representation  in  generalization  is  the  process  of  symbolizing 
abstracted  objects.  For  example,  this  representation  process  is  to  determine  which 
symbolized  subset  of  streets  is  the  best  suited  in  order  to  well  represent  «a  street 
networks  This  choice  is  guided  by  the  necessity  to  well  represent  the  abstracted 
object  and  restricted  by  the  drawing  possibilities  (we  can  not  represent  all  the 
symbolized  streets  because  they  will  overlap  themselves  on  the  paper). 

Difference  but  not  independence.  It  is  important  to  notice  that  knowledge 
abstraction  and  representation  can  not  be  performed  independently  one  from  the 
other,  nor  that  when  abstraction  has  been  done  the  “ground”  GDB  is  no  more 
necessary.  For  example,  in  a  street  network  the  drawn  streets  are  a  subset  of  the 
“ground”  streets.  The  abstracted  object  «street  network»  helped  us  to  change  our 
view  of  the  world,  but  the  representation  process  needs  to  look  again  at  the  ground 
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GDB  to  represent  actual  objects  of  the  world.  In  this  way  we  imitate  the  human 
perception,  which  continuously  change  the  level  of  abstraction  to  well  analyze  space. 
These  inter-links  between  abstraction  and  representation  explain  why,  manually, 
these  two  steps  have  always  been  performed  in  one  time  by  the  cartographer. 
Anyway,  it  has  been  shown  that  efficient  expert  systems  need  to  clearly  separate  the 
different  types  of  knowledge  involved  in  human  processes  0.  We  so  believe  that  this 
distinction  between  abstraction  and  representation  is  necessary  for  the  creation  of  an 
automated  process  of  cartographic  generalization. 


Figure  3  -  Representation  and  abstraction  in  cartography 

Our  model  is  related  to  Brassel  and  Weibel’s  famous  model  of  cartographic 
generalization  0,  in  the  sense  that  it  differentiates  the  drawing  process  from  other 
processes.  But  the  main  differences  comes  from  the  knowledge  abstraction  step 
which,  in  our  model,  is  far  from  being  a  simple  «statistical  generalization*.  More,  we 
think  that  this  step  is  the  most  complex  step  of  cartographic  generalization,  even  if 
the  representation  process  is  still  far  from  being  well  mastered  in  the  field  of 
automatic  cartographic  generalization. 

For  these  reasons,  our  model  is  closely  related  to  Nyerges’  view  of  cartographic 
generalization  0  which  splits  generalization  in  two  phases:  «Geographical 
information  abstraction  mainly  concerns  managing  geographical  meaning  in 
databases,  and  map  generalization  mainly  concerns  structuring  map  presentations ». 


4  Structuring  Knowledge  for  Its  Acquisition 

The  distinction  enhanced  between  abstraction  and  representation  is  necessary  for 
efficient  knowledge  acquisition  in  cartographic  generalization.  Because  of  space 
limitation  we  do  not  detail  this  section.  We  only  quote  the  different  types  of 
knowledge  involved  in  these  two  processes. 
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One  the  one  hand,  the  knowledge  abstraction  process  is  seen  as  the  identification 
of  geographic  objects  that  are  relevant  to  the  user  needs  (and  need  to  be  drawn).  It 
manipulates  1/  geographic  knowledge  (e.g.  town  model)  which  could  be  acquired 
through  geographers  interview  2/  perception  knowledge  (Gestalt)  which  has  been  a 
lot  studied  in  psychology  and  cartography  3/  space  analysis  knowledge,  which  should 
be  acquired  in  computers  through  space  analysis  tools  (e.g.  Delaunay  triangulation) . 
One  the  other  hand,  representation  knowledge  manipulates 

1.  graphic  knowledge  to  define  when  a  map  is  legible,  which  is  well  known  by 
cartographers 

2.  drawing  knowledge  to  define  how  to  represent  an  abstracted  object  and  which 
could  be  acquired  through  cartographers  interviews  or  manually  drawn  maps 
analysis 

3.  algorithm  knowledge  which  could  be  acquired  through  cartographers  drawing 
analysis  by  machine  learning  techniques. 


5  Conclusion 

One  of  the  key  problems  limiting  the  automation  of  cartographic  knowledge 
acquisition  lies  in  the  heterogeneity  of  the  knowledge  that  is  involved  throughout  the 
process  of  creating  maps  from  geographic  databases.  In  this  article,  we  have  adapted 
an  Artificial  Intelligence  model  to  distinguish  two  fundamental  transformations  used 
in  the  field  of  cartography,  namely  abstraction  and  representation  of  knowledge. 
The  first  contribution  of  this  work  is  therefore  to  propose  a  classification  of 
cartographic  and  geographic  knowledge  along  this  dimension  in  order  to  facilitate  its 
acquisition. 
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